Hacker News new | past | comments | ask | show | jobs | submit login
Beating C with Dyalog APL (ummaycoc.github.io)
102 points by lelf on Oct 24, 2019 | hide | past | favorite | 56 comments

Dyalog implementor here. The hot loops in this function are running my code!

I'm not surprised at all about this result, although I certainly wouldn't use it to make a pronouncement about Dyalog or C as a whole. But there are some places where interpreted array languages have a major advantage over typical compiled languages.

One of the advantages seen in this wc function is our use of bit booleans rather than byte booleans. Packing 8 bits in a byte uses eight times less space, and can lead to drastic speed improvements: 2-8 times faster than even code which uses short ints.

On that note, the function

can be improved by keeping the data boolean. 1=(1↓⍵)-(¯1)↓⍵ is equivalent to the windowed reduction 2</⍵ which identifies places where the boolean argument increased in value. We get:

Since this function doesn't produce a 1-byte int result from subtraction, it's many times faster. I discuss some similar functions to 2</ in a blog post at https://www.dyalog.com/blog/2018/06/expanding-bits-in-shrink....

Another big improvement is coming next year in Dyalog 18.0. The computations data∊nl data∊sp will use vectorised search methods similar to Intel's hyperscan (https://github.com/intel/hyperscan). The character lookups here are changed to be lookups from a 256-bit table, or two SSE registers, and searched using branchless SSSE3 instructions. I didn't go through this specific algorithm but explained many of our new search techniques at last year's user meeting: https://www.youtube.com/watch?v=paxIkKBzqBU.

By my measurements, each improvement (changing the word boundary computation and switching to the unreleased Dyalog 18.0) knocks of about a third from the total time. With both, this code is three times faster than it was before!

Some timings to show just how quickly Dyalog is improving, taken with the original function from OP and big.txt (https://github.com/ChrisPenner/wc/blob/master/data/big.txt) on a Kaby Lake i7. Dyalog releases about once a year, and 17.1 had almost no performance changes.

  15.0: 93ms
  16.0: 82ms
  17.0:  7.4ms
  17.1:  7.5ms
  18.0:  5.3ms
There's still more room for improvement: although arithmetic functions use AVX2 when available, functions based on SSSE3 shuffle don't yet. That includes Membership (∊) here, which takes the majority of the runtime in my improved version of wc and should be twice as fast with AVX2.

I really liked your sub-nanosecond searching talk, for my taste it felt like a great blend of explaining exactly what you do at the low levels, without dragging me through the weeds of exactly what you do at the low levels. A great piece of optimization.

On the subject of Dyalog APL performance, one of your other talks about a proposal for thunking / lazy execution, IIRC there was a slide of "all high level patterns the interpreter recognises and special-cases", and I was surprised how few there are. About a dozen, or so.

Given how often people voice that "there's potential for an APL interpreter to recognise this slow prime number generator and special-case it", I assumed a lot of the work of speeding up an APL interpreter would be years of building up a vast array of special case pattern handlers, and that doesn't seem to be the case. Is it much harder than it seems? Do the same code patterns not come up often enough to bother with?

Just having a well-chosen and fast set of primitives goes a long way. If you write a three-primitive combination and the interpreter doesn't recognise it but all three primitives are fast, how bad is it, really? You're losing a factor of three at worst (which is still in faster-than-C territory much of the time), and probably more like 1.5 or 2 since the special combination would be more complicated. Sometimes you actually gain by splitting an algorithm into multiple passes: I remember an instance where I hand-wrote some nice branchless AVX2 code to find the index of the minimum of a numeric vector (it's (⊃⍋) but don't expect that to be fast yet). Then I wrote a better vectorised minimum and tried out (x⍳⌊/x), which just gets the overall minimum and searches the vector for it. Worst case that algorithm was 25% slower. In the best case, when the minimum was near the end and it could stop early, it was twice as fast!

That said, we still do a lot of work on recognising patterns. The list of idioms from my presentations are patterns that are recognised as a sequence of tokens in parsing, but we can also recognise particular derived functions (the results of operators, or function trains). An obvious example is the sum +/ and there are some pretty complicated ones involving Rank or Key. There are probably about a hundred cases like this in total although many of these cases handle several different combinations. Much of what I do is not to try to identify more special cases to handle with a custom algorithm but to make the algorithms more general and to use them in more places. I tend to develop engines (say, a column permuter using vector shuffles) and then write a bunch of code to recognise when I can use the engines for particular cases within a primitive or combination (indexing, reverse, rotate, take/drop, and transpose on trailing axes).

The most important work is on short patterns because the longer ones just don't show up often enough, or they show up in too many different permutations. Thunks are a way to recognise short patterns flexibly: it doesn't depend on the way the pattern is written, just the functions used. They also offer flexibility in that different operations can sometimes emit the same thunk with a parameter, allowing other functions to just handle that type of thunk as a whole. We've run into some trouble with our internal architecture that's holding thunks up (should be cleared up by the 18.0 release so we can start implementing them for 19.0), but I think recognising special combinations will be very different, and much easier, once they're working. And I can finally get the aforementioned (⊃⍋) running fast. And 1↑⍋. And ⊣/⍋...

Thinking on this for a while, it makes more sense to spend optimization effort on a few widely useful engines which will improve many people's code in many situations, than on hundreds of narrow improvements which might improve some code in some situations and will add a lot more integration risk and maintenance overhead.

Turning +/iota into a constant time formula feels like a great idea APL is in a position to benefit from, and that mathematicians could see endless places where that kind of thing might be possible - but if that's true, APL users could find them when they need them. They can't find vectorised pick of grade up, ever.

This is so interesting, thank you.

Awesome. Would you mind if I add a version like this to the article and acknowledge you?

Also, will anyone from Dyalog be at Code Mesh?

> One of the advantages seen in this wc function is our use of bit booleans rather than byte booleans. Packing 8 bits in a byte uses eight times less space, and can lead to drastic speed improvements: 2-8 times faster than even code which uses short ints.

I once rewrote a structured verilog compiler (when I was the sole full time employee at ChipScan.us); going from using bytes to store boolean values to packing the entire circuit's wire state into bits in a char* definitely helped a lot. Also marking that char* as restrict helped a lot, too :-)

Big wins are made of many small victories.

Go ahead. Incidentally boolean tricks like the windowed reduction are fairly common knowledge in the APL community, even though they're considered esoteric outside of it. It might be helpful to know that every function with boolean arguments and result which depends on both arguments has its own primitive (one of ∧∨⍲⍱<≤=≥>≠). Once you know you're transforming pairs of booleans to booleans, you just have to figure out which one it is.

We're not sending anyone to Code Mesh. I hadn't heard of the conference, but given how close it is maybe we should apply in the future.

Do you have an explanation handle on the windowed reductions? Specifically what the number on the left means in this case (or in general)? I've tried it and it looks like `(n+1)</v` is basically `2</(n</v)` for `n>1` but I also see some things that don't exactly fit that pattern.

The left argument is a window length. Windowed reduction groups the argument into overlapping windows with that length and does a reduction on each one of them. So if y has row length n, (n ?/ y) is the same as (?/ y) but the rank isn't reduced by one (there's an extra 1 at the end of the shape). For a numeric vector you can use ,/ as a windowed reduction to see the subvectors that are reduced over, but for a nested vector that trick doesn't quite work because it messes up the nesting.

You can email me with my first name at the company site to talk more. There's also a chat room at https://chat.stackexchange.com/rooms/52405/the-apl-orchard which is good for asking questions.

I definitely heard at least one mention of Dyalog APL there last year (not in a conversation I was in, I was literally walking past some others who were chatting).

I hope to see some of you there sometime in the future.

Perhaps not a fair comparison until you account for the C library's string comparison behaviors. If your default is LANG=C.UTF-8, your wc may be a bunch slower than LANG=C. I think OSX is still using gnu wc, yes? Maybe the Dialog race should be repeated with this controlled for.


Edit. Still true in Linux today.

    $ LANG=C time -p wc /tmp/foo 2>&1 >/dev/null
    real 0.29
    user 0.28
    sys 0.01

    $ LANG=C.UTF-8 time -p wc /tmp/foo 2>&1 >/dev/null
    real 0.86
    user 0.86
    sys 0.00

Looks like they have the same problem. They're using LC_CTYPE to hint mbrtowc(). In fact, if you web search "mbrtowc slow" you'll get a bunch of hits about "Why is wc so slow"!! :-)

I'm not heavily invested in speed benchmarks, but I am a big fan of APL/J/K, and I am always glad to see how others write code with them. I like that I can handwrite a J or APL solution, while thinking it through before getting to my phone or computer to actually test it. It's my version of doing a crossword puzzle or a math problem. I still love C too, since it was my third language after Basic on my Commodore PET, and machine language assembly on my VIC-20. The Programming the VIC book in 1984 cost me $24.95 at the time.

In the HN thread for the original article, geocar posted one line of q that does the job as well: https://news.ycombinator.com/item?id=21267923

Beating benchmarks is always fun, but I think the ergonomics of the solution matter a great deal.

That is to say the words we use matter: I'm excited that Haskell, and OCaml (and so on) have efficient solutions, but I'm extremely disappointed that the best implementations look nothing like the "obvious" approach.

Maybe that's to be expected, after all an "obvious" implementation in C that pumps getchar() all day will be pretty slow, and experienced C programmers will do their own buffering and threading to win -- and that doesn't look like the "obvious" one either.

And yet, in k/q the "obvious" implementation is the fast one. That's cool, and way more cool than you might realise as long as you think coding is hard.

I think this shows off far more power than the Haskell version; a rather straightforward implementation which is faster than the original but not with the optimizing the author went through with the Haskell version.

Readability wise this suffers (not everyone will recognize the cr lf space tab sequence and then it would be really hard to understand I would think) but I don't find it much more unreadable than the faster Haskell versions (you need a lot of prior knowledge to fluently read those as well) and at least the q version is built from primitives while with Haskell it needs quite a lot of libs to make it perform at all. I like Haskell (and more static typing in general) but I also like having 1, short, manual and being able to implement everything I want to implement without having to include many (unknown/potentially painful) libs but without writing 10000s of lines of code either.

> Just like Chris Penner’s original article, I’m comparing against the OSX version of wc that shipped with my machine. Just like in the original article, I admit that there are likely faster versions of wc–I’m just comparing what I got.

Could someone on OSX post a benchmark of their system wc vs. one compiled manually from source using gcc or clang with -O3?

Last time around I did this for Ubuntu and found a 2x difference: https://news.ycombinator.com/item?id=21271951

So if this article ends up beating their system wc by 2x, that might not say anything about "beating C".

Wow this is a bizarre esolang.

It’s not an esolang. It has a long history of actual real production use.

Didn't Iverson win the Turing award for ALP?

Production use these days is more likely to be J, which you can actually type on a normal keyboard.

It is different to look at, but the editing modes let you type the APL characters without too much difficulty. I spent about 3 months doing a deep-dive into APL last year, it took about two weeks to become proficient typing it (already a touch typist, though). I still remember most of the character positions (the worst part for me was that I use Dvorak, so the visualized keymap is useless since most assume QWERTY layouts). I mostly used emacs, I think the default prefix character was `, but switched it to . . Typing iota: .i. Typing rho: .r.

that doesn't stop it being an esolang

... although the code quality of the examples makes it look a million times more like one than it needs to

>that doesn't stop it being an esolang

Actually it does:

"An esoteric programming language (sometimes shortened to esolang) is a programming language designed to test the boundaries of computer programming language design, as a proof of concept, as software art, as a hacking interface to another language (particularly functional programming or procedural programming languages), or as a joke"

Right, and APL could not be further from the definition given that it started as a chalkboard notation for reasoning about array operations!

Right, and Lisp started as a notation in a paper with no intention of building a language. But nobody would call it an esolanguage.

How something started and what it became are orthogonal...

I'm pretty sure Jack the Ripper started as a cute baby too...

> Lisp started as a notation in a paper with no intention of building a language


Well, it has some element of truth. The S-expression notation we think of as “Lisp” originated as a contribution to recursive function theory rather than a practical programming language, even if earlier McCarthy was thinking of proceeding by adding IPL-V facilities to Fortran.

McCarthy wanted a list processing language. He added list processing functions (which then later appeared in Lisp) to Fortran. He and his team then went on to develop and implement a new language: Lisp. The implementation ALWAYS was based on lists and s-expressions.

The paper http://www-formal.stanford.edu/jmc/recursive.html was published, when Lisp was already implemented, to present some ideas to a different audience. But Lisp already existed by then as a running programming language.

First and foremost Lisp was a real programming language implemented on a real computer.


See also: History of Lisp, John McCarthy, http://jmc.stanford.edu/articles/lisp/lisp.pdf

"My desire for an algebraic list processing language for artificial intelligence work on the IBM 704 computer arose in the summer of 1956..."

"The implementation of LISP began in Fall 1958 ... These included programs to read and print list structure. I can’t now remember whether the decision to use parenthesized list notation as the external form of LISP data was made then or whether it had already been used in discussing the paper differentiation program."

"The M-notation also used brackets instead of parentheses to enclose the arguments of functions in order to reserve parentheses for list-structure constants. It was intended to compile from some approximation to the M-notation, but the M-notation was never fully defined, because representing LISP functions by LISP lists became the dominant programming language when the interpreter later became available."

"Anyway, I decided to write a paper describing LISP both as a programming language and as a formalism for doing recursive function theory. The paper was Recursive functions of symbolic ex- pressions and their computation by machine, part I (McCarthy 1960)."

An amusing coda to the M-notation story is that many extant examples of it closely resemble or are valid K/Q syntax.

Right, that's exactly what I mean. Thank you for a adding all these great references to primary sources! Has CBI or somebody done an oral history interview with Slug Russell yet covering this time period? He's not dead yet. There's still time.

Just that the s-expression notation did not originate as contribution to recursive function theory, but was actually originating from designing and implementing a programming language and system for list processing.

I think your reading of these sources differs from mine, then, but the important thing is that now anyone who is interested can peruse them at their leisure and make up their own mind.

Yes. Originally it was a thought experiment. Just like mathematicians can imagine an idea and use a notation without creating a compiler.

Sure not. McCarthy wanted to develop a list processing system for the IBM 704 in the mid 50s. He experimented with Fortran and added list processing primitives and then went on to design and implement a new language: Lisp

McCarthy early on proposed to write a compiler - actually this was one of the very very early papers:

J. McCarthy. Memo to P. M. Morse: A Proposal for a compiler. Memo CC-56, Computation Center, Massachusetts Institute of Technology, December 13, 1957, 19 pages.


but being faster than c is actually trivial and obvious even in the general case...

here is something i crapped out 6 years ago to prove the point: https://github.com/semiessessi/CP1

--verbose please, before we breathlessly run out to install Visual Studio so we can run code we know nothing about?

Pretty sure the words function can be simplified to {⍴(~⍵∊⍺)⊆⍵} (and it also handles empty strings).

How nice of you to copy/paste my reddit comment!

C is a low bar. It has always been slower than Fortran. It has been slower than C++ ever since reasonable optimization was implemented. Anything not faster than C, today, is a slow language, practically by definition.

> It has been slower than C++ ever since reasonable optimization was implemented.

Which compilers? Which platforms? Optimized for speed or space (because "faster" is only one dimension of optimization)?

Many C / C++ compiler implementations share the same front ends, back ends and runtime libraries, so it's hard to see how the code generation for C would be much different than that of C++. (In fact, C++ will be harder to optimize if features like exceptions and RTTI are used).

Given the many different commercially and freely available toolchains, this statement is difficult to back up.

The one area where C++ can be a lot faster than C is in places where C uses a function pointer where C++ uses a function template.

The standard example is sorting. Say you’re sorting an array of integers. Then, C’s sort has to (1) call the comparison function passed as an argument, whereas C++’s std::sort can inline the comparisons into a single instruction.

(1) if the source of the function passed in is visible from the compilation unit, I think the C standard allows the compiler to compile a specialized version of sort in the same way C++ can, but I’m not aware of any compiler that does that

Right, inlines are hard to beat, and they are not part of the C standard (which is kind of mind boggling, I mean . . . honestly, what year is it?).

Still, the benefit of inlines diminishes as the size of the inlined functions increases (you start paying penalties for extra cache lines full of code, and the percentage of time in function call overhead gets small in a hurry).

The linker can get into the inlining game, too, by the way, if the win is sufficiently big. I have scars to prove it. :-)

What do you mean by "inlines are not part of the C standard"?

The standard talks about visible behaviour of programs, never the concrete implementation.

You're correct about C99. I cut my teeth on K&R; C99 has inline declarations.

A C compiler can certainly also inline functions, as long as the definition is visible. Neither is there any guarantee that all C++ template instantiations will be inlined.

That’s true, and C compilers do so for short functions such as memset and strcpy (and doesn’t even need the definition to be visible), but historically, they haven’t for longer functions, and, I guess, many programmers wouldn’t like it if, say, they generated ten specialized copies of sort, speeding up those calls, but also blowing up executable size.

On the contrary, C++ programmers historically expect all templated code to be specialized. It’s only relatively recently that C++ linkers started to merge copies of template expansions.

s/The one area/One area/.

Another is where a better algorithm is available in a library than would be convenient to implement in custom form in C. C++ libraries get a lot more attention to optimization than C libraries because it pays back much more. For example, there are really excellent hash table libraries in C++ that would be unimplementable, as a reusable library, in C. C hash tables are typically hand-rolled in place.

For my part, I'm never interested in blanket comparisons of what language is faster, but it is worthwhile to mention specific features and how they impact performance.

For example, in Python you can talk about the GIL and pointer chasing. But you can also talk about Cython. It turns out that, for me, Python's FFI is a really important factor, and a big reason why it's much easier for me to get good performance in Python than it is in Java: It makes it easy to interact with C libraries without taking a big hit on marshaling data in and out of Java's managed heap. (This is, incidentally, a major factor in Python having so thoroughly supplanted Java in the ML/BI/DS/analytics space.) But there are other situations where Java would be the clear winner.

My favorite performance comparison between languages ever was the one that Raymond Chen and Rico Mariani did on implementing a practical program in C# and C++. [1] As it turns out that, at least as of 14 years ago, C++ is faster than C#, and C# is faster than C++, under different circumstances. You've really got to pay attention to which ones matter most to you before deciding where to place your bets, and just paying attention to rankings on The Computer Language Benchmarks Game is a sign that you've tried to jump to the answer without really taking enough time to understand the question.

[1]: https://blogs.msdn.microsoft.com/ricom/2005/05/10/performanc...

> … just paying attention to rankings on The Computer Language Benchmarks Game is a sign that you've tried to jump to the answer without really taking enough time to understand the question.

Instead, we could pay attention to program measurements:


:and source code.

There is no such thing as a C / C++ compiler implementation. There are C implementations, and C++ implementations that may share some components. They do not typically share optimizers, because optimization of C++ requires quite different characteristics of an optimizer than one for C -- although most of the C optimizations may be applied after the C++ ones have run.

If it is hard to see how C++ optimization would differ from C optimization, you simply aren't looking very hard.

There are many different toolchains, but with the toolchains that are commonly used, C++ programs are faster than C programs. More specifically: C++ programs are at least as fast as C programs, because a C program is a badly-coded C++ program that fails to use language constructs that would make it a better program.

Citations needed. As others have pointed out this is not a comparison of optimized Dyalog APL with C. This is a comparison of Dyalog APL with a pre-built command-line tool with unknown optimizations.

Someone in the post on the Haskell version posted the source to the version of wc used [0], and it was a fairly naive implementation.

0: https://opensource.apple.com/source/text_cmds/text_cmds-68/w...

1st comment. Ever. Yes, here in the 23rd century APL has returned to its righteous glory and C is just a few bytes of late 20th century IT(to use your language) history

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact