I'm not surprised at all about this result, although I certainly wouldn't use it to make a pronouncement about Dyalog or C as a whole. But there are some places where interpreted array languages have a major advantage over typical compiled languages.
One of the advantages seen in this wc function is our use of bit booleans rather than byte booleans. Packing 8 bits in a byte uses eight times less space, and can lead to drastic speed improvements: 2-8 times faster than even code which uses short ints.
On that note, the function
Another big improvement is coming next year in Dyalog 18.0. The computations data∊nl data∊sp will use vectorised search methods similar to Intel's hyperscan (https://github.com/intel/hyperscan). The character lookups here are changed to be lookups from a 256-bit table, or two SSE registers, and searched using branchless SSSE3 instructions. I didn't go through this specific algorithm but explained many of our new search techniques at last year's user meeting: https://www.youtube.com/watch?v=paxIkKBzqBU.
By my measurements, each improvement (changing the word boundary computation and switching to the unreleased Dyalog 18.0) knocks of about a third from the total time. With both, this code is three times faster than it was before!
On the subject of Dyalog APL performance, one of your other talks about a proposal for thunking / lazy execution, IIRC there was a slide of "all high level patterns the interpreter recognises and special-cases", and I was surprised how few there are. About a dozen, or so.
Given how often people voice that "there's potential for an APL interpreter to recognise this slow prime number generator and special-case it", I assumed a lot of the work of speeding up an APL interpreter would be years of building up a vast array of special case pattern handlers, and that doesn't seem to be the case. Is it much harder than it seems? Do the same code patterns not come up often enough to bother with?
That said, we still do a lot of work on recognising patterns. The list of idioms from my presentations are patterns that are recognised as a sequence of tokens in parsing, but we can also recognise particular derived functions (the results of operators, or function trains). An obvious example is the sum +/ and there are some pretty complicated ones involving Rank or Key. There are probably about a hundred cases like this in total although many of these cases handle several different combinations. Much of what I do is not to try to identify more special cases to handle with a custom algorithm but to make the algorithms more general and to use them in more places. I tend to develop engines (say, a column permuter using vector shuffles) and then write a bunch of code to recognise when I can use the engines for particular cases within a primitive or combination (indexing, reverse, rotate, take/drop, and transpose on trailing axes).
The most important work is on short patterns because the longer ones just don't show up often enough, or they show up in too many different permutations. Thunks are a way to recognise short patterns flexibly: it doesn't depend on the way the pattern is written, just the functions used. They also offer flexibility in that different operations can sometimes emit the same thunk with a parameter, allowing other functions to just handle that type of thunk as a whole. We've run into some trouble with our internal architecture that's holding thunks up (should be cleared up by the 18.0 release so we can start implementing them for 19.0), but I think recognising special combinations will be very different, and much easier, once they're working. And I can finally get the aforementioned (⊃⍋) running fast. And 1↑⍋. And ⊣/⍋...
Turning +/iota into a constant time formula feels like a great idea APL is in a position to benefit from, and that mathematicians could see endless places where that kind of thing might be possible - but if that's true, APL users could find them when they need them. They can't find vectorised pick of grade up, ever.
This is so interesting, thank you.
Also, will anyone from Dyalog be at Code Mesh?
> One of the advantages seen in this wc function is our use of bit booleans rather than byte booleans. Packing 8 bits in a byte uses eight times less space, and can lead to drastic speed improvements: 2-8 times faster than even code which uses short ints.
I once rewrote a structured verilog compiler (when I was the sole full time employee at ChipScan.us); going from using bytes to store boolean values to packing the entire circuit's wire state into bits in a char* definitely helped a lot. Also marking that char* as restrict helped a lot, too :-)
Big wins are made of many small victories.
We're not sending anyone to Code Mesh. I hadn't heard of the conference, but given how close it is maybe we should apply in the future.
You can email me with my first name at the company site to talk more. There's also a chat room at https://chat.stackexchange.com/rooms/52405/the-apl-orchard which is good for asking questions.
I hope to see some of you there sometime in the future.
Edit. Still true in Linux today.
$ LANG=C time -p wc /tmp/foo 2>&1 >/dev/null
$ LANG=C.UTF-8 time -p wc /tmp/foo 2>&1 >/dev/null
That is to say the words we use matter: I'm excited that Haskell, and OCaml (and so on) have efficient solutions, but I'm extremely disappointed that the best implementations look nothing like the "obvious" approach.
Maybe that's to be expected, after all an "obvious" implementation in C that pumps getchar() all day will be pretty slow, and experienced C programmers will do their own buffering and threading to win -- and that doesn't look like the "obvious" one either.
And yet, in k/q the "obvious" implementation is the fast one. That's cool, and way more cool than you might realise as long as you think coding is hard.
Readability wise this suffers (not everyone will recognize the cr lf space tab sequence and then it would be really hard to understand I would think) but I don't find it much more unreadable than the faster Haskell versions (you need a lot of prior knowledge to fluently read those as well) and at least the q version is built from primitives while with Haskell it needs quite a lot of libs to make it perform at all. I like Haskell (and more static typing in general) but I also like having 1, short, manual and being able to implement everything I want to implement without having to include many (unknown/potentially painful) libs but without writing 10000s of lines of code either.
Could someone on OSX post a benchmark of their system wc vs. one compiled manually from source using gcc or clang with -O3?
Last time around I did this for Ubuntu and found a 2x difference: https://news.ycombinator.com/item?id=21271951
So if this article ends up beating their system wc by 2x, that might not say anything about "beating C".
... although the code quality of the examples makes it look a million times more like one than it needs to
Actually it does:
"An esoteric programming language (sometimes shortened to esolang) is a programming language designed to test the boundaries of computer programming language design, as a proof of concept, as software art, as a hacking interface to another language (particularly functional programming or procedural programming languages), or as a joke"
How something started and what it became are orthogonal...
I'm pretty sure Jack the Ripper started as a cute baby too...
The paper http://www-formal.stanford.edu/jmc/recursive.html was published, when Lisp was already implemented, to present some ideas to a different audience. But Lisp already existed by then as a running programming language.
First and foremost Lisp was a real programming language implemented on a real computer.
See also: History of Lisp, John McCarthy, http://jmc.stanford.edu/articles/lisp/lisp.pdf
"My desire for an algebraic list processing language for artificial intelligence work on the IBM 704 computer arose in the summer of 1956..."
"The implementation of LISP began in Fall 1958 ... These included programs to read and print list structure. I can’t now remember whether the decision to use parenthesized list notation as the external form of LISP data was made then or whether it had already been used in discussing the paper differentiation program."
"The M-notation also used brackets instead of parentheses to enclose the arguments of functions in order to reserve parentheses for list-structure constants. It was intended to compile from some approximation to the M-notation, but the M-notation was never fully defined, because representing LISP functions by LISP lists became the dominant programming language when the interpreter later became available."
"Anyway, I decided to write a paper describing LISP both as a programming language and as a formalism for doing recursive function theory. The paper was Recursive functions of symbolic ex- pressions and their computation by machine, part I (McCarthy 1960)."
McCarthy early on proposed to write a compiler - actually this was one of the very very early papers:
J. McCarthy. Memo to P. M. Morse: A Proposal for a compiler. Memo CC-56, Computation Center, Massachusetts Institute of Technology, December 13, 1957, 19 pages.
here is something i crapped out 6 years ago to prove the point: https://github.com/semiessessi/CP1
Which compilers? Which platforms? Optimized for speed or space (because "faster" is only one dimension of optimization)?
Many C / C++ compiler implementations share the same front ends, back ends and runtime libraries, so it's hard to see how the code generation for C would be much different than that of C++. (In fact, C++ will be harder to optimize if features like exceptions and RTTI are used).
Given the many different commercially and freely available toolchains, this statement is difficult to back up.
The standard example is sorting. Say you’re sorting an array of integers. Then, C’s sort has to (1) call the comparison function passed as an argument, whereas C++’s std::sort can inline the comparisons into a single instruction.
(1) if the source of the function passed in is visible from the compilation unit, I think the C standard allows the compiler to compile a specialized version of sort in the same way C++ can, but I’m not aware of any compiler that does that
Still, the benefit of inlines diminishes as the size of the inlined functions increases (you start paying penalties for extra cache lines full of code, and the percentage of time in function call overhead gets small in a hurry).
The linker can get into the inlining game, too, by the way, if the win is sufficiently big. I have scars to prove it. :-)
The standard talks about visible behaviour of programs, never the concrete implementation.
On the contrary, C++ programmers historically expect all templated code to be specialized. It’s only relatively recently that C++ linkers started to merge copies of template expansions.
Another is where a better algorithm is available in a library than would be convenient to implement in custom form in C. C++ libraries get a lot more attention to optimization than C libraries because it pays back much more. For example, there are really excellent hash table libraries in C++ that would be unimplementable, as a reusable library, in C. C hash tables are typically hand-rolled in place.
For example, in Python you can talk about the GIL and pointer chasing. But you can also talk about Cython. It turns out that, for me, Python's FFI is a really important factor, and a big reason why it's much easier for me to get good performance in Python than it is in Java: It makes it easy to interact with C libraries without taking a big hit on marshaling data in and out of Java's managed heap. (This is, incidentally, a major factor in Python having so thoroughly supplanted Java in the ML/BI/DS/analytics space.) But there are other situations where Java would be the clear winner.
My favorite performance comparison between languages ever was the one that Raymond Chen and Rico Mariani did on implementing a practical program in C# and C++.  As it turns out that, at least as of 14 years ago, C++ is faster than C#, and C# is faster than C++, under different circumstances. You've really got to pay attention to which ones matter most to you before deciding where to place your bets, and just paying attention to rankings on The Computer Language Benchmarks Game is a sign that you've tried to jump to the answer without really taking enough time to understand the question.
Instead, we could pay attention to program measurements:
:and source code.
If it is hard to see how C++ optimization would differ from C optimization, you simply aren't looking very hard.
There are many different toolchains, but with the toolchains that are commonly used, C++ programs are faster than C programs. More specifically: C++ programs are at least as fast as C programs, because a C program is a badly-coded C++ program that fails to use language constructs that would make it a better program.