
Beating C with Dyalog APL - lelf
https://ummaycoc.github.io/wc.apl/
======
mlochbaum
Dyalog implementor here. The hot loops in this function are running my code!

I'm not surprised at all about this result, although I certainly wouldn't use
it to make a pronouncement about Dyalog or C as a whole. But there are some
places where interpreted array languages have a major advantage over typical
compiled languages.

One of the advantages seen in this wc function is our use of bit booleans
rather than byte booleans. Packing 8 bits in a byte uses eight times less
space, and can lead to drastic speed improvements: 2-8 times faster than even
code which uses short ints.

On that note, the function

    
    
      words←{(~(¯1)↑⍵)++/1=(1↓⍵)-(¯1)↓⍵}
    

can be improved by keeping the data boolean. 1=(1↓⍵)-(¯1)↓⍵ is equivalent to
the windowed reduction 2</⍵ which identifies places where the boolean argument
increased in value. We get:

    
    
      words←{(~¯1↑⍵)++/2</⍵}
    

Since this function doesn't produce a 1-byte int result from subtraction, it's
many times faster. I discuss some similar functions to 2</ in a blog post at
[https://www.dyalog.com/blog/2018/06/expanding-bits-in-
shrink...](https://www.dyalog.com/blog/2018/06/expanding-bits-in-shrinking-
time/).

Another big improvement is coming next year in Dyalog 18.0. The computations
data∊nl data∊sp will use vectorised search methods similar to Intel's
hyperscan
([https://github.com/intel/hyperscan](https://github.com/intel/hyperscan)).
The character lookups here are changed to be lookups from a 256-bit table, or
two SSE registers, and searched using branchless SSSE3 instructions. I didn't
go through this specific algorithm but explained many of our new search
techniques at last year's user meeting:
[https://www.youtube.com/watch?v=paxIkKBzqBU](https://www.youtube.com/watch?v=paxIkKBzqBU).

By my measurements, each improvement (changing the word boundary computation
and switching to the unreleased Dyalog 18.0) knocks of about a third from the
total time. With both, this code is three times faster than it was before!

~~~
jodrellblank
I really liked your sub-nanosecond searching talk, for my taste it felt like a
great blend of explaining exactly what you do at the low levels, without
dragging me through the weeds of _exactly_ what you do at the low levels. A
great piece of optimization.

On the subject of Dyalog APL performance, one of your other talks about a
proposal for thunking / lazy execution, IIRC there was a slide of "all high
level patterns the interpreter recognises and special-cases", and I was
surprised how few there are. About a dozen, or so.

Given how often people voice that "there's potential for an APL interpreter to
recognise this slow prime number generator and special-case it", I assumed a
lot of the work of speeding up an APL interpreter would be years of building
up a vast array of special case pattern handlers, and that doesn't seem to be
the case. Is it much harder than it seems? Do the same code patterns not come
up often enough to bother with?

~~~
mlochbaum
Just having a well-chosen and fast set of primitives goes a long way. If you
write a three-primitive combination and the interpreter doesn't recognise it
but all three primitives are fast, how bad is it, really? You're losing a
factor of three at worst (which is still in faster-than-C territory much of
the time), and probably more like 1.5 or 2 since the special combination would
be more complicated. Sometimes you actually gain by splitting an algorithm
into multiple passes: I remember an instance where I hand-wrote some nice
branchless AVX2 code to find the index of the minimum of a numeric vector
(it's (⊃⍋) but don't expect that to be fast yet). Then I wrote a better
vectorised minimum and tried out (x⍳⌊/x), which just gets the overall minimum
and searches the vector for it. Worst case that algorithm was 25% slower. In
the best case, when the minimum was near the end and it could stop early, it
was twice as fast!

That said, we still do a lot of work on recognising patterns. The list of
idioms from my presentations are patterns that are recognised as a sequence of
tokens in parsing, but we can also recognise particular derived functions (the
results of operators, or function trains). An obvious example is the sum +/
and there are some pretty complicated ones involving Rank or Key. There are
probably about a hundred cases like this in total although many of these cases
handle several different combinations. Much of what I do is not to try to
identify more special cases to handle with a custom algorithm but to make the
algorithms more general and to use them in more places. I tend to develop
engines (say, a column permuter using vector shuffles) and then write a bunch
of code to recognise when I can use the engines for particular cases within a
primitive or combination (indexing, reverse, rotate, take/drop, and transpose
on trailing axes).

The most important work is on short patterns because the longer ones just
don't show up often enough, or they show up in too many different
permutations. Thunks are a way to recognise short patterns flexibly: it
doesn't depend on the way the pattern is written, just the functions used.
They also offer flexibility in that different operations can sometimes emit
the same thunk with a parameter, allowing other functions to just handle that
type of thunk as a whole. We've run into some trouble with our internal
architecture that's holding thunks up (should be cleared up by the 18.0
release so we can start implementing them for 19.0), but I think recognising
special combinations will be very different, and much easier, once they're
working. And I can finally get the aforementioned (⊃⍋) running fast. And 1↑⍋.
And ⊣/⍋...

~~~
jodrellblank
Thinking on this for a while, it makes more sense to spend optimization effort
on a few widely useful engines which will improve many people's code in many
situations, than on hundreds of narrow improvements which might improve some
code in some situations and will add a lot more integration risk and
maintenance overhead.

Turning +/iota into a constant time formula feels like a great idea APL is in
a position to benefit from, and that mathematicians could see endless places
where that kind of thing might be possible - but if that's true, APL users
could find them when they need them. They can't find vectorised pick of grade
up, ever.

This is so interesting, thank you.

------
imglorp
Perhaps not a fair comparison until you account for the C library's string
comparison behaviors. If your default is LANG=C.UTF-8, your wc may be a bunch
slower than LANG=C. I think OSX is still using gnu wc, yes? Maybe the Dialog
race should be repeated with this controlled for.

[https://old.reddit.com/r/programming/comments/1sxpgp/make_gr...](https://old.reddit.com/r/programming/comments/1sxpgp/make_grep_50x_faster/)

Edit. Still true in Linux today.

    
    
        $ LANG=C time -p wc /tmp/foo 2>&1 >/dev/null
        real 0.29
        user 0.28
        sys 0.01
    
        $ LANG=C.UTF-8 time -p wc /tmp/foo 2>&1 >/dev/null
        real 0.86
        user 0.86
        sys 0.00

~~~
saagarjha
It’s a FreeBSD derivative:
[https://opensource.apple.com/source/text_cmds/text_cmds-99/w...](https://opensource.apple.com/source/text_cmds/text_cmds-99/wc/wc.c.auto.html)

~~~
imglorp
Looks like they have the same problem. They're using LC_CTYPE to hint
mbrtowc(). In fact, if you web search "mbrtowc slow" you'll get a bunch of
hits about "Why is wc so slow"!! :-)

------
eggy
I'm not heavily invested in speed benchmarks, but I am a big fan of APL/J/K,
and I am always glad to see how others write code with them. I like that I can
handwrite a J or APL solution, while thinking it through before getting to my
phone or computer to actually test it. It's my version of doing a crossword
puzzle or a math problem. I still love C too, since it was my third language
after Basic on my Commodore PET, and machine language assembly on my VIC-20.
The Programming the VIC book in 1984 cost me $24.95 at the time.

------
cousin_it
In the HN thread for the original article, geocar posted one line of q that
does the job as well:
[https://news.ycombinator.com/item?id=21267923](https://news.ycombinator.com/item?id=21267923)

~~~
geocar
Beating benchmarks is always fun, but I think the ergonomics of the solution
matter a great deal.

That is to say the words we use matter: I'm excited that Haskell, and OCaml
(and so on) have efficient solutions, but I'm extremely disappointed that the
best implementations look nothing like the "obvious" approach.

Maybe that's to be expected, after all an "obvious" implementation in C that
pumps getchar() all day will be pretty slow, and experienced C programmers
will do their own buffering and threading to win -- and _that_ doesn't look
like the "obvious" one either.

And yet, in k/q the "obvious" implementation is the fast one. That's cool, and
way more cool than you might realise as long as you think coding is hard.

------
tom_mellior
> Just like Chris Penner’s original article, I’m comparing against the OSX
> version of wc that shipped with my machine. Just like in the original
> article, I admit that there are likely faster versions of wc–I’m just
> comparing what I got.

Could someone on OSX post a benchmark of their system wc vs. one compiled
manually from source using gcc or clang with -O3?

Last time around I did this for Ubuntu and found a 2x difference:
[https://news.ycombinator.com/item?id=21271951](https://news.ycombinator.com/item?id=21271951)

So if this article ends up beating their system wc by 2x, that might not say
anything about "beating C".

------
person_of_color
Wow this is a bizarre esolang.

~~~
empath75
It’s not an esolang. It has a long history of actual real production use.

~~~
pjc50
Production use these days is more likely to be J, which you can actually type
on a normal keyboard.

~~~
Jtsummers
It is different to look at, but the editing modes let you type the APL
characters without too much difficulty. I spent about 3 months doing a deep-
dive into APL last year, it took about two weeks to become proficient typing
it (already a touch typist, though). I still remember most of the character
positions (the worst part for me was that I use Dvorak, so the visualized
keymap is useless since most assume QWERTY layouts). I mostly used emacs, I
think the default prefix character was `, but switched it to . . Typing iota:
.i. Typing rho: .r.

------
jheriko
but being faster than c is actually trivial and obvious even in the general
case...

here is something i crapped out 6 years ago to prove the point:
[https://github.com/semiessessi/CP1](https://github.com/semiessessi/CP1)

~~~
tom_mellior
\--verbose please, before we breathlessly run out to install Visual Studio so
we can run code we know nothing about?

------
breadandcrumbel
Pretty sure the words function can be simplified to {⍴(~⍵∊⍺)⊆⍵} (and it also
handles empty strings).

~~~
olzd
How nice of you to copy/paste my reddit comment!

~~~
coldtea
Relevant comment parent mentions here:
[https://www.reddit.com/r/programming/comments/dku13a/beating...](https://www.reddit.com/r/programming/comments/dku13a/beating_c_with_dyalog_apl_wc/)

------
ncmncm
C is a low bar. It has always been slower than Fortran. It has been slower
than C++ ever since reasonable optimization was implemented. Anything _not_
faster than C, today, is a slow language, practically by definition.

~~~
kabdib
> It has been slower than C++ ever since reasonable optimization was
> implemented.

Which compilers? Which platforms? Optimized for speed or space (because
"faster" is only one dimension of optimization)?

Many C / C++ compiler implementations share the same front ends, back ends and
runtime libraries, so it's hard to see how the code generation for C would be
much different than that of C++. (In fact, C++ will be harder to optimize if
features like exceptions and RTTI are used).

Given the many different commercially and freely available toolchains, this
statement is difficult to back up.

~~~
Someone
The one area where C++ can be a lot faster than C is in places where C uses a
function pointer where C++ uses a function template.

The standard example is sorting. Say you’re sorting an array of integers.
Then, C’s _sort_ has to (1) call the comparison function passed as an
argument, whereas C++’s _std::sort_ can inline the comparisons into a single
instruction.

(1) if the source of the function passed in is visible from the compilation
unit, I think the C standard allows the compiler to compile a specialized
version of sort in the same way C++ can, but I’m not aware of any compiler
that does that

~~~
kabdib
Right, inlines are hard to beat, and they are not part of the C standard
(which is kind of mind boggling, I mean . . . honestly, what year is it?).

Still, the benefit of inlines diminishes as the size of the inlined functions
increases (you start paying penalties for extra cache lines full of code, and
the percentage of time in function call overhead gets small in a hurry).

The linker can get into the inlining game, too, by the way, if the win is
sufficiently big. I have scars to prove it. :-)

~~~
colonwqbang
What do you mean by "inlines are not part of the C standard"?

The standard talks about visible behaviour of programs, never the concrete
implementation.

~~~
kabdib
You're correct about C99. I cut my teeth on K&R; C99 has inline declarations.

------
JamesT_Kirk
1st comment. Ever. Yes, here in the 23rd century APL has returned to its
righteous glory and C is just a few bytes of late 20th century IT(to use your
language) history

