
How quickly can you remove spaces from a string? - deafcalculus
http://lemire.me/blog/2017/01/20/how-quickly-can-you-remove-spaces-from-a-string/
======
jakobegger
Fun story: When I implemented the syntax highlighting feature in Sequel Pro, I
needed a way to determine the string length of UTF-8 encoded strings. I didn't
find a function available on macOS that worked directly on a char*, so I
googled and found a really simple UTF-8 strlen function. It was easy to
understand and very fast. I think it was this:
[http://canonical.org/~kragen/strlen-
utf8.html](http://canonical.org/~kragen/strlen-utf8.html)

I committed my syntax highlighting code, and a few days later someone had
replaced the simple UTF8 strlen function with the really long vectorised
version from this page: [http://www.daemonology.net/blog/2008-06-05-faster-
utf8-strle...](http://www.daemonology.net/blog/2008-06-05-faster-
utf8-strlen.html)

But the funny thing is, that the supposedly fast vectorised strlen was
optimised for very long strings. The benchmarks shows results for megabytes of
texts. But we measured the length of tokens, usually only a few characters
long, so in most cases the new function was actually slower!

I was a very junior dev, didn't want to piss off anybody, and the strlen
wasn't in the hot path anyway, so I didn't say anything. But I was a bit sad,
that my easy-to-read code was replaced by such a monstrosity.

What's my point? Before you go and use these functions in your code, profile
your code to see if it would actually affect performance.

------
dzdt
Somehow the article fails to mention that the speed will depend heavily on the
input string: both its size and distribution of whitespace characters.

Even assuming the question is for the limit of very long strings, the
distribution makes a huge difference. Natural English has spaces on average
every 5.1 characters [1], so using multi-character tests to speed up the case
of runs of 8 or more characters without a space will probably slow it down,
not speed it up!

[1][https://arxiv.org/pdf/1208.6109](https://arxiv.org/pdf/1208.6109)

~~~
nkurz
_using multi-character tests to speed up the case of runs of 8 or more
characters without a space will probably slow it down, not speed it up!_

Yes, although there is the option of combining the vectorized comparisons with
a branchless approach. I modified his code to do this, and got what seemed to
be a flat .40 cycles per character independent of input, which is about twice
the speed Daniel illustrated on the input with 3% spaces. I'm sure it can be
made even faster (what would the limiting factor be?), but I think this shows
that a multi-character approach is faster on all input than would be possible
with any single-character approach?

~~~
rjeli
Could you share your code?

~~~
nkurz
I'm working with the author, and it should be in his GitHub repository soon.
But if you want to reproduce before then, you should be able to just change
the line "if (mask16 == 0)" to "if (0)" in sse4_despace(), which is in
include/despace.h.

The idea is that rather than testing whether any spaces were found and taking
a "shortcut", we always do a lookup on mask16, shuffle the bytes according to
the result to remove the spaces, and advance the output pointer by 16 minus
the number of spaces found. This costs 2-3 cycles for completely spaceless
input, but saves the ~15 cycle branch misprediction penalty each time an
unexpected space is found.

------
computator
What strikes me is how many sites won't accept a login name, credit card
number, phone number, or other field with leading or trailing spaces. This for
something where the speed of the code doesn't matter at all. I can't think of
any explanation other than incredible laziness or incompetence for not
stripping off spaces.

You might not notice this if you use a password manager or browser autofill,
but it's a lot of sites, including companies like major airlines for example.

Never mind that -- it's just a miracle when you can enter a credit card number
formatted as 4123 4567 8901 2345 rather than squished together.

~~~
edblarney
You hit the issue in a roundabout way:

The very premise of the article 'how quickly can we remove whitespaces' is
rooted in the intellectual foundations of CS. As a culture, we are are
obsessive about 'performance'.

This is because back in the day, it's always been important - and even today
'under the hood' it's always important. And of course there are situations in
which it's still important (complex algs, limitations of mobile devices).

But in reality - these things are never the issue.

The 'issue' is the pragmatic application of basic algorithms to do a number of
basic things elegantly, which together form the foundation of a good user
experience.

Yes - the issue of 'no spaces' in card numbers etc. is a clumsy thing, and
it's laziness by developers.

Also - things like 'performance' are objectively measurable, you can get cool
data for it etc..

A 'bad experience' is sometimes difficult to define.

~~~
open-source-ux
_As a culture, we are are obsessive about 'performance'._

My purely anecdotal impression is quite the opposite. Speed of delivery and
convenience for the _developer_ (not the end user) seem to be the norm.

Frameworks, scripting languages, browser-based desktop and web apps: none of
these have the characteristics of being small, nimble, lightweight, or
performant. They certainly make life easier for the developer. Whether users
get a 'good experience' out of the end result is open to debate.

------
dsp1234
_This code will work on all UTF-8 encoded strings… which is the bulk of the
strings found on the Internet if you consider that UTF-8 is a superset of
ASCII._

Note that the first set of code (and possibly the rest), only work as the
space, newline, and carriage return are the 7 bit ASCII set is included in
UTF-8. However, the extended 8-bit ASCII set is not, but is often included
when people speak of ASCII. So for example, if the request was to remove all
"Copyright Sign" symbols, which is U00A9, it would not work correctly. The
UTF-8 encoding for this symbol is 0xC2 0xA9, but the code only works on
individual bytes, so it would remove the A9 byte, leaving a C2 byte and then
whatever byte came next. Additionally, it would hit other UTF-8 characters
like the "Greek Capital Letter Omega" (Ω which is encoded in UTF-8 as 0xCE
0xA9)

tl;dr Only works for the 7-bit ASCII set, but not the common extended 8-bit
ASCII sets

~~~
ajross
It's sort of a semantic argument, but I've never once heard anyone refer to
"ASCII" to refer to anything other than the 7-bit standard. There is no
"extended 8-bit ASCII set". That's a blanket term (though in casual use, "code
page" is more typical) for any of the literally dozens of 8-bit character
encodings that overlap with ASCII in the bottom half of the encoding space.

Basically the blog post was correct and precise: it removes ASCII characters
from UTF-8. There is no elaboration needed.

Again, though, the post was about using SIMD primitives to optimize what looks
like a scalar problem.

~~~
Sharlin
You would be surprised. Referring to any of the most common 8-bit encodings as
"extended ASCII" or even just "ASCII" has been very common in my experience.

~~~
versteegen
Yeah. Most people (well, back in the DOS days anyway) who don't know what a
"7-bit encoding" is would refer to code pages as "ASCII". I'm sure I did that
hundreds of times myself (I didn't know better until I was what, 20?) and I've
seen it thousands of times, not exaggerating. Just look at how many
"[extended] ASCII games" there are.

~~~
mikekchar
I know a lot of DOS people used the term "extended ASCII" to mean specific
8-bit encodings whose first 7 bits were ASCII, but what "extended ASCII" meant
depended quite a lot on the context. I came from a different background and it
was very clear to me that ASCII was a 7 bit encoding that was almost always
stored in an 8 bit byte (I can only think of a few instances in my whole life
where I saw bitstreams of ASCII). So that last bit you could do whatever you
wanted with.

Having said that, I clearly remember arguing with people that encodings with
the eighth bit set were _not_ part of ASCII. As you say, there were many
people who didn't understand. My guess is that the GP never interacted with
those people. Especially if you were around pre-DOS I think it would be easy
to do.

------
nogracias
6.5MB for the tables to drive this faster de-spacer? No thank you. It would
blow out the L1 cache and make your program slower.

[https://raw.githubusercontent.com/lemire/despacer/master/inc...](https://raw.githubusercontent.com/lemire/despacer/master/include/despacer_tables.h)

~~~
jnordwick
My pet peeve: using table lookups so a benchmark show faster when in reality
your L1 cache is going to be stomped on. Not only will you be waiting on the
L1 cache to populate but you also evict all the useful data.

~~~
exDM69
I've seen this elsewhere too. For relatively mundane task (I think it was
Morton code conversion), a giant lookup table was constructed and it gave a
nice 2x or 4x performance improvement.

But no application will ever be doing only string despacing or Morton codes,
so the "fast" lookup table algorithm will make everything else slower by
evicting good cache lines. And once something else runs and evicts the lookup
tables, the next run will be slow again.

------
et1337
Cached text-only version:
[http://webcache.googleusercontent.com/search?q=cache%3Ahttp%...](http://webcache.googleusercontent.com/search?q=cache%3Ahttp%3A%2F%2Flemire.me%2Fblog%2F2017%2F01%2F20%2Fhow-
quickly-can-you-remove-spaces-from-a-string%2F&strip=1)

------
brianpgordon
This reminded me of spray-json's JsonParser. It has a bit of Scala code to
seek past whitespace:

    
    
      @tailrec private def ws(): Unit =
        // fast test whether cursorChar is one of " \n\r\t"
        if (((1L << cursorChar) & ((cursorChar - 64) >> 31) & 0x100002600L) != 0L) { advance(); ws() }
    

[https://github.com/spray/spray-
json/blob/765c83248e0bbe867dd...](https://github.com/spray/spray-
json/blob/765c83248e0bbe867ddc9d479b4fe79493569a54/src/main/scala/spray/json/JsonParser.scala#L186)

~~~
hota_mazi
The compiler needs to be extremely clever on @tailrec for not blowing up the
L1 cache on this code.

To me, this looks like a perfect example of a developer thinking they're very
smart while they're actually writing code that's worse that its naive
counterpart version (a good summary of Scala in my experience).

~~~
jnordwick
Can you please explain why it would blow up the L1 cache?

------
octo_t
So using 128bits instructions would imply you had 'words' which were over 16
characters long on average, right?

The average (english) word is ~5 characters long, so most of the time, you'd
be forced to check anyway.

~~~
drdrey
Not all strings are made of English text

~~~
sfrailsdev
you don't even have to go to non english, just utf-8 has stuff like mathematic
spaces and spaces of different em sizes.
[http://perldoc.perl.org/perlrecharclass.html](http://perldoc.perl.org/perlrecharclass.html)
has a reasonably good list of non ascii spaces, though I'm not sure where to
dig through for locale specific lists.

That said, while it may not be a solution for every case, it's a solution for
the common case and a starting point for other cases, and thus pretty nifty
and potentially useful.

------
mnarayan01
I feel like you have to at _least_ include non-breaking spaces if you're going
to say you're removing spaces from UTF-8 strings.

------
criddell
Is there a clever way to remove all the space characters?

[https://www.cs.tut.fi/~jkorpela/chars/spaces.html](https://www.cs.tut.fi/~jkorpela/chars/spaces.html)

~~~
pjscott
If you use the naive algorithm at the top of the page but iterate over UTF-8
code points rather than bytes -- which is straightforward, BTW -- you _will_
get some cleverness automatically in the compiler's implementation of the
switch statement. I wrote a switch to handle the Unicode "is it space?"
function, with the values from the table you linked, and compiled with "clang
-Os". You know what it did?

It generated a friggin' binary search tree! Some of the leaves were a sequence
of straight-line comparisons, because that's more compact, but the higher
levels were all a bunch of "if (c > 0x167F) { ... }" sort of code. At one
point it subtracts 8192 from something and then compares with 12, and I
_think_ this is because the x86 instruction encoding is shorter and the
compiler knows that the register won't be needed again along either of the
code paths from that point.

Compilers are amazing sometimes.

~~~
userbinator
Compiling switches into binary (and even sometimes trinary) trees is an old
technique that's been around since at least the days of 16-bit DOS... they
will usually choose between a single-level lookup, double-level lookup, or
tree depending on sparsity and options used.

Subtraction and addition implicitly sets the flags, so you can generate very
small code with inc/dec (1-byte instructions), like this:

    
    
        dec ax
        jz ax_was_zero
        dec ax
        jz ax_was_one
        ...

------
stelund
Maybe the asm to do scan for byte in memory is an alternative. Repne scab.

It won't find all characters in a single scan. But maybe do 3 passes over a
buffer which fits in cache.

------
aib
I wonder how much of a difference using a separate destination buffer would
make. Apart from some memory/cache/invalidation/magic thing that I'm not
certain might occur, it would allow one to use the extended instructions used
for scanning.

------
Annatar

      awk '
      {
        gsub(/[[:blank:]\015]/, "");
        printf("%s", $0);
      }' input | tee output
    

stripping out LF isn't necessary because AWK does it on every record
automatically. For even more speed, the code can be translated into ANSI C and
compiled with awka[1] using an optimizing C compiler.

[1]
[http://awka.sourceforge.net/index.html](http://awka.sourceforge.net/index.html)

------
exDM69
I am surprised to see any speedup in this! I'd expect something trivial like
this to be completely memory bound with the CPU sitting almost idle waiting
for bytes coming in from memory.

Looking at the benchmark code, this is using rdtsc to read the CPU time stamp
counter. That does not take waiting for memory into account, does it?

I wonder if there's a difference when measured in wall clock time. It's still
somewhat beneficial to have the CPU work efficiently to give an opportunity
for hyperthreading to take place when waiting for memory.

If you really wanted to make something like this faster, you should focus on
cache utilization and make use of prefetching instructions. x86 has pretty bad
prefetching instructions and pretty good speculative fetching, so don't expect
massive speedups but on ARM or Aarch64, you have a finer grained control over
cache prefetching (L1 and L2 separately) and you could see _much_ bigger
differences.

As for benchmarking this kind of problems: you obviously want to measure real
world performance, so you need wall clock time as well as time stamp counter,
but I'd look for optimization clues in "perf stat" and other CPU perf
counters, with an emphasis on cache misses and branch mispredictions.

The figure you should be staring at is the total _throughput_ of the
algorithm, measured in gigabytes per second. You should be getting figures
close to the memory bandwidth available (25-50 GB/s depending on CPU and
memory).

edit: I measured the wall clock time with clock_gettime before/after all the
repeats (using a megabyte sized buffer) and there is indeed no significant
difference, here's my results:

    
    
        memcpy(tmpbuffer,buffer,N):  0.122945 cycles / ops 1495907352 nsec (1.495907 sec) 
        countspaces(buffer, N):  3.657322 cycles / ops 1544915395 nsec (1.544915 sec) 
        despace(buffer, N):  6.521193 cycles / ops 1621204460 nsec (1.621204 sec) 
        faster_despace(buffer, N):  1.721657 cycles / ops 1500507217 nsec (1.500507 sec) 
        despace64(buffer, N):  3.595031 cycles / ops 1544993649 nsec (1.544994 sec) 
        despace_to(buffer, N, tmpbuffer):  6.307885 cycles / ops 1615101563 nsec (1.615102 sec) 
        avx2_countspaces(buffer, N):  0.190992 cycles / ops 1460961459 nsec (1.460961 sec) 
        avx2_despace(buffer, N):  5.750583 cycles / ops 1615971010 nsec (1.615971 sec) 
        sse4_despace(buffer, N):  0.985002 cycles / ops 1482901389 nsec (1.482901 sec) 
        sse4_despace_branchless(buffer, N):  0.338737 cycles / ops 1460874704 nsec (1.460875 sec) 
        sse4_despace_trail(buffer, N):  1.950657 cycles / ops 1502268447 nsec (1.502268 sec) 
        sse42_despace_branchless(buffer, N):  0.562246 cycles / ops 1468638389 nsec (1.468638 sec) 
        sse42_despace_branchless_lookup(buffer, N):  0.624913 cycles / ops 1472445127 nsec (1.472445 sec) 
        sse42_despace_to(buffer, N,tmpbuffer):  1.747046 cycles / ops 1507705780 nsec (1.507706 sec)
    

Here's the diff to the original:
[http://pasteall.org/208511](http://pasteall.org/208511)

edit2: surprisingly, Clang is about 10% slower than GCC in my experiments.

~~~
nkurz
Thanks for looking into this! I know that Daniel appreciates feedback, but
rarely reads HN, so an email to him or blog comment with your results might be
helpful.

 _Looking at the benchmark code, this is using rdtsc to read the CPU time
stamp counter. That does not take waiting for memory into account, does it?_

On modern Intel processors, the "time stamp counter" is monotonically
increasing at a constant rate, so it does take memory latency into account. On
many Linux systems (including the one used here) clock_gettime() uses the same
underlying clock source, so there should be no difference in accuracy for long
measurements. The CPUID-RDTSC/RDSTCP-CPUID pattern used here has the advantage
of a somewhat lower and significantly more predictable overhead, which helps
when measuring shorter events.

 _I 'd expect something trivial like this to be completely memory bound with
the CPU sitting almost idle waiting for bytes coming in from memory._

If I remember the numbers right, the Skylake processor this is running on can
read about 64B per cycle if the source is L1, about 24B per cycle if the
source is L3, and about 6B per cycle from main memory. Using the existing
RDTSC framework, I get equal speeds at L3 size, and still get sub .4B/cycle
coming from main memory.

 _I measured the wall clock time with clock_gettime before /after all the
repeats (using a megabyte sized buffer) and there is indeed no significant
difference_

I agree that testing on larger buffers would be informative (as would testing
different ratios of whitespace) but I don't think your approach is capturing
what you think it is. The macro being used for time measurement uses a
different random input for each iteration (look at 'pre'), and the unoptimized
time of initializing this dominates the clock time. So while I think your test
is worthwhile, I think you need a better way to perform the measurement. I
think you'll see that RDTSC maps exactly to wall time in this case, but
surprises are definitely possible.

 _I 'd look for optimization clues in "perf stat" and other CPU perf counters,
with an emphasis on cache misses and branch mispredictions._

Yes, although as with wall time, one needs to be sure to measure only on the
section of code that one is optimizing. Perf makes this difficult, so
something like "likwid" with an API that allows profiling fragments of code
would be required. I haven't done that yet for this, but at the faster speeds
(sse4_despace_branchless) I don't think there are any cache misses or branch
mispredictions in the code of interest.

 _surprisingly, Clang is about 10% slower than GCC in my experiments_

My surprise was the opposite. Clang was 25% faster than GCC and ICC on the L1-
and L3-sized vectorized branchless (.25 cycles/byte versus .33 cycles/byte,
although the code I'm running is not quite what's on Github). Most of this
benefit seems to be because clang unrolls 2x to reduce loop overhead.

~~~
exDM69
I dropped a comment on his blog and a link to this discussion...

And how sloppy of me! I didn't notice the `pre;` in the code consuming most of
the time (to be fair it's just 4 characters and I didn't put too much time to
it). When I move that outside of the loop, before the timer, I get results
that show improvement with the optimized version.

And indeed it looks like rdtsc is giving similar figures to clock_gettime now.
I falsely presumed that it counts retired instructions.

I'm still surprised to see a speedup, and how badly the original version is
performing.

My guess is that the conditional stores are poison to the CPU pipelines. The
64 bit version gets most of the performance out of it already, I presume it's
because of the more efficient memory usage pattern.

I can't edit my earlier post any more, I'd correct it if I could.

Clang:

    
    
        memcpy(tmpbuffer,buffer,N):  0.000000 cycles / ops 65232 nsec (0.000065 sec) 
        countspaces(buffer, N):  0.000000 cycles / ops 65176 nsec (0.000065 sec) 
        despace(buffer, N):  4.547286 cycles / ops 7654310198 nsec (7.654310 sec) 
        faster_despace(buffer, N):  1.582721 cycles / ops 2651677500 nsec (2.651678 sec) 
        despace64(buffer, N):  0.583952 cycles / ops 1025215835 nsec (1.025216 sec) 
        despace_to(buffer, N, tmpbuffer):  0.000000 cycles / ops 63847 nsec (0.000064 sec) 
        avx2_countspaces(buffer, N):  0.000000 cycles / ops 63697 nsec (0.000064 sec) 
        avx2_despace(buffer, N):  0.307253 cycles / ops 602528061 nsec (0.602528 sec) 
        sse4_despace(buffer, N):  0.310504 cycles / ops 534542967 nsec (0.534543 sec) 
        sse4_despace_branchless(buffer, N):  0.353851 cycles / ops 594652080 nsec (0.594652 sec) 
        sse4_despace_trail(buffer, N):  0.314221 cycles / ops 562439990 nsec (0.562440 sec) 
        sse42_despace_branchless(buffer, N):  0.608811 cycles / ops 1020576734 nsec (1.020577 sec) 
        sse42_despace_branchless_lookup(buffer, N):  0.608494 cycles / ops 1020062750 nsec (1.020063 sec) 
        sse42_despace_to(buffer, N,tmpbuffer):  1.779908 cycles / ops 2983955982 nsec (2.983956 sec) 
    

GCC:

    
    
        memcpy(tmpbuffer,buffer,N):  0.285702 cycles / ops 489599835 nsec (0.489600 sec) 
        countspaces(buffer, N):  0.000000 cycles / ops 63995 nsec (0.000064 sec) 
        despace(buffer, N):  4.751018 cycles / ops 8014809122 nsec (8.014809 sec) 
        faster_despace(buffer, N):  1.718575 cycles / ops 2898082416 nsec (2.898082 sec) 
        despace64(buffer, N):  0.883421 cycles / ops 1560479651 nsec (1.560480 sec) 
        despace_to(buffer, N, tmpbuffer):  6.424313 cycles / ops 10892716258 nsec (10.892716 sec) 
        avx2_countspaces(buffer, N):  0.031227 cycles / ops 52369789 nsec (0.052370 sec) 
        avx2_despace(buffer, N):  0.315633 cycles / ops 627793484 nsec (0.627793 sec) 
        sse4_despace(buffer, N):  0.319739 cycles / ops 554173689 nsec (0.554174 sec) 
        sse4_despace_branchless(buffer, N):  0.366240 cycles / ops 615638316 nsec (0.615638 sec) 
        sse4_despace_trail(buffer, N):  0.318973 cycles / ops 572262343 nsec (0.572262 sec) 
        sse42_despace_branchless(buffer, N):  0.638114 cycles / ops 1070772787 nsec (1.070773 sec) 
        sse42_despace_branchless_lookup(buffer, N):  0.637988 cycles / ops 1069642163 nsec (1.069642 sec) 
        sse42_despace_to(buffer, N,tmpbuffer):  1.768081 cycles / ops 2963488990 nsec (2.963489 sec)

------
chrismorgan
A variant of this problem yields further interesting possibilities: if you’re
trying to remove control characters like CR and LF, but replacing them with
whitespace would be acceptable. That way you can work on it in-place, without
needing to copy memory or allocate or anything like that.

------
saretired
The “optimized” functions have a bug when the number of remaining bytes is
less than the block size.

~~~
nkurz
Certainly possible, but are you looking at the excerpts in his blog post, or
the actual code that he links to?
[https://github.com/lemire/despacer/blob/master/include/despa...](https://github.com/lemire/despacer/blob/master/include/despacer.h)

~~~
saretired
I'm looking at the excepts in his post, which are broken.

------
Too
Someone file a compiler bug? 14x difference between the readable code and the
optimized code is a lot. The first code is extremely straightforward, you
shouldn't have deal with that SIMD mess manually.

~~~
Sharlin
Failure to optimize is never a bug, by definition. Automatic vectorization of
sequential code is a very difficult problem in general. Especially when your
code still depends on testing single bytes in the input and vectorizing _that_
is only possible using some very clever bit twiddling.

------
fisherjeff
Adding lookup tables to the naïve implementation gave me a ~3x speedup with
virtually no extra effort. Branch penalties are a real killer.

~~~
twoodfin
You have to be careful with lookup tables. They can be impressive each at a
time when microbenchmarking a single loop, but in the aggregate hurt
performance by crowding other useful data out of limited cache space.

------
ascotan
I guess you could write CUDA code to do this on a GPU, but then the question
becomes why? :/

~~~
antonmks
Ok, here it is :
[https://gist.github.com/antonmks/17e0c711d41fb07d1b6f3ada3f5...](https://gist.github.com/antonmks/17e0c711d41fb07d1b6f3ada3f5f29ee)

0.25 cycles/byte using GTX 1080.

------
Darge
Does anyone know how to do such a benchmark?

~~~
sp332
Check out the linked Github repo.

------
ebbv
Optimizing by hand is rarely the fastest thing to do nowadays. I wonder how
the original naive approach fairs with all optimizations turned on for gcc and
with realistic input?

~~~
Too
The optimization flags used should be good enough.

[https://github.com/lemire/despacer/blob/master/Makefile](https://github.com/lemire/despacer/blob/master/Makefile)

CFLAGS = -fPIC -std=c99 -O3 -march=native

------
cowardlydragon
ASCII/1 byte characters? I stopped reading then.

UNICODE or GTFO.

Should be embarassingly parallel for a contiguous array.

~~~
foxhill
this is in no way embarrassingly parallel at all! in fact, it's entirely non
trivial to parallelise, writing the output is an inhertintly a serial
operation.

~~~
mrkgnao
Could you not do something like chunk the input and operate on the pieces in
parallel? I'd think you could use something like a rope (à la text editors) to
support the concurrent writes.

(I may or may not have forgotten the differences between concurrency and
parallelism.)

~~~
foxhill
yeah, you could split the input into chunks, but then you're left with the
problem of recombining the outputs from each portion into one string - that
operation must be serial and would likely outweigh the benefits of the initial
parallelism in most cases.

------
johnnyb9
Going to start using this as an interview question!

~~~
webo
Please don't. Especially for a typical software engineering position.

------
austincheney
Isn't this the very kind of thing Regular Expressions were created to solve?

* Remove spaces - myString.replace(/\u0020+/g, "");

* Remove common line terminators - myString.replace(/(\r|\n)+/g, "");

* Remove all white space - myString.replace(/\s+/g, "");

~~~
SonicSoul
the post is not so much about most convenient way to remove spaces, but about
efficiency of such an operation.

~~~
austincheney
Is regex execution considered efficient? I have always been impressed with its
speed.

~~~
Sharlin
Even if the regex library compiled the expression down to the very simplest
state machine possible, it would still be _at most_ as efficient as the first
handwritten code in the article, handling one byte at a time. I'm fairly sure
no regex engine in existence is able to detect "simple enough" regexes and
optimize them into code that handles 64 or 128 bit blocks at a time.

~~~
burntsushi
They exist. In fact, if you hand Rust's regex engine the pattern ` |\r|\n`,
then it will compile it down to a very simple alternation of three literal
bytes. It will never touch the actual regex engine at all. Once the regex
engine knows its a simple alternation of three bytes, it will actually use a
variant of memchr called memchr3[1], which looks almost exactly like the OP's
"portable" approach.

So it's not quite the fastest possible because it doesn't use SIMD, but this
is mostly because we don't have good explicit SIMD support on stable Rust yet.
Once we do, this immediately becomes a strong candidate to optimize to
something like the OP's fastest version (and also probably generalizing to
AVX2 instructions as well).

N.B. My sibling comment linked to my blog post on ripgrep, which uses Rust's
regex engine. ;-)

[1] - [https://github.com/BurntSushi/rust-
memchr/blob/master/src/li...](https://github.com/BurntSushi/rust-
memchr/blob/master/src/lib.rs#L347)

~~~
Sharlin
Awesome, thank you!

------
hcrisp
Not claiming to be the fastest, but here are two Python solutions suggested by
Dave Beazley [0], plus one I extended using compiled regular expressions:

    
    
        # string replacement
        s = ' hello  world \n'
        s.replace(' ', '')
        # 'helloworld'
    
        # regular expression
        import re
        re.sub('\s+', '', s)
        # 'helloworld'
    
        # compiled regular expression
        pat = re.compile('\s+')
        pat.sub('', s)
        # 'helloworld'
    

[0]
[https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&c...](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0ahUKEwiDwpPQytHRAhVJWCwKHSTmD5wQFggaMAA&url=https%3A%2F%2Fwww.safaribooksonline.com%2Flibrary%2Fview%2Fpython-
cookbook-3rd%2F9781449357337%2Fch02s11.html&usg=AFQjCNEZqL6SV76UmE4ojyJ5poOgeCexEQ&sig2=I9kGVDVg5DyLxZ3pCJ1FIw&bvm=bv.144224172,d.bGg)

~~~
niccl
my first python thought was

    
    
      s = ' hello  world \n'
      s = ''.join(s.split())
    

again probably not fast, but simple

~~~
justinhj
Scala: val nospaces = "a string with spaces" filter { _ != ' ' }

