
Beating Decades of Optimized C with 80 Lines of Haskell - rachitnigam
https://chrispenner.ca/posts/wc
======
Sysreq1
Throwing multiple cores at the problem is not what I would call beating
decades of optimized C. The author only utilizes multiple cores when he
realizes the overhead of Haskell will not allow him to actually win. Author
even admits the C program was more efficient. You can multi thread a C program
as well, at which point it would retake the title.

~~~
simias
There seems to be an unhealthy obsession with beating C in some corner of the
Haskell community. It makes even less sense nowadays when some of the most
popular languages out there are Javascript and Python, clearly C-grade
performance is not required anymore by the vast majority of applications.

I actually wrote a comment (in 2013, but the page I'm referencing doesn't
appear to have changed one bit) where I shared my experience looking at
haskell.org's introduction that was filled with FUD regarding C and how
Haskell was so much better and faster:

[https://news.ycombinator.com/item?id=5090808](https://news.ycombinator.com/item?id=5090808)

Maybe if they had actually tried to teach me the language instead of that bad
faith "used car salesman" tactic I'd be writing Haskell these days.

~~~
tom_mellior
> There seems to be an unhealthy obsession with beating C in some corner of
> the Haskell community.

I think it's about the same for pretty much any language in the "statically
typed, compiled" camp. You'll see "faster than C" claims for Rust and Go as
well, for example.

~~~
simias
That's true. I guess what makes Haskell stand out to me is that at this point
it's fairly clear that the assertion is mostly wrong (i.e. idiomatic Haskell
is _not_ faster than idiomatic C at most algorithmic tasks, and often
significantly slower) and instead of doing the reasonable thing and saying
"additional safety and expressiveness is more important than raw performance
for most applications" (something I'd completely agree with personally) they
double-down with these meaningless, unfair benchmarks.

Meanwhile Rust is genuinely competitive with C performance-wise and thanks to
generics and things like more aggressive inlining can actually equal or even
beat C without requiring esoteric micro-optimizations. Of course the drawback
is that writing Rust code is significantly more complicated and more
restrictive.

------
test9753
Beating Decades of Optimized C with _27_ Lines of Ocaml:

    
    
      type t = { mutable words: int; mutable chars: int; mutable lines: int; mutable in_word: bool ; mutable finished: bool};;
    
      let () =
        match Core.Sys.argv with
        | [| prog_name |] -> Core.Printf.eprintf "Usage: %s file1 file2 ...\n" prog_name
        | _ -> (
          let args = Core.Array.slice Core.Sys.argv 1 @@ Core.Array.length Core.Sys.argv in
          let buf_size = 65536 in (* 64 KB -> Caml IO buffer size *)
          let buf = Core.Bytes.create buf_size in
          Core.Array.fold args ~init:() ~f:(fun _ file ->
            Core.In_channel.with_file file ~f:(fun in_ch ->
              let c = { words = 0; chars = 0; lines = 0; in_word = false; finished = false } in
              let set_words () = if c.in_word then c.words <- c.words + 1 in
              while not c.finished do
                let len = Core.In_channel.input in_ch ~buf ~pos:0 ~len:buf_size |> Core.Int.to_int in
                if len > 0 then (
                  for i = 0 to (len - 1) do
                    match (Core.Caml.Bytes.get buf i) with
                    | ' ' | '\t' | '\r' -> (c.chars <- c.chars + 1; set_words (); c.in_word <- false)
                    | '\n'              -> (c.chars <- c.chars + 1; set_words (); c.in_word <- false; c.lines <- c.lines + 1)
                    | _                 -> (c.chars <- c.chars + 1; c.in_word <- true)
                  done
                ) else ( c.finished <- true )
              done;
              set_words ();
              Core.Printf.printf "%s -> lines: %d, words: %d, characters: %d\n" file c.lines c.words c.chars)))
      ;;
    

Testing:

    
    
      $ ls -lh test_file.txt 
      -rw-r--r--  1 user  group   508M Oct 16 12:01 test_file.txt
    
      $ time wc test_file.txt 
      4863460 54621760 532480000 test_file.txt
    
      real 0m3.368s
      user 0m3.137s
      sys  0m0.195s
    
      #Ocaml version:
      $ time ./wcl.native test_file.txt 
      test_file.txt -> lines: 4863460, words: 54621760, characters: 532480000
    
      real 0m2.480s
      user 0m2.345s
      sys  0m0.123s

~~~
geocar
Oh this is fun! Beating Decades of Optimised C (best of three)

    
    
        $ time wc w.txt
         3156098 6312196 380648004 w.txt
        
        real 0m1.358s
        user 0m1.277s
        sys 0m0.066s
    

... with one line of q (best of three):

    
    
        q)\t g:{(sum[1<deltas where x]+sum[not x:max x],0),sum max x 0 1}0x0d0a2009=\:read1 `:w.txt
        783
        q)g
        380648004i
        6312196
        3156098i
    

783msec: almost twice as fast!

~~~
yiyus
Won't this give an incorrect number of bytes for several consecutive
whitespace characters? Why not using # to count the bytes instead?

~~~
tluyben2
in k, but with that same bug';

(b:#a;b++/,/a=/:" ";b+#,/a)

~~~
tluyben2
Here is one without the bug;

b:#a;c:,/a=/:" ";(b;b++/c@&~c&=':c;b+#,/a)

Now to see if I can check the performance.

~~~
yiyus
We discussed this at length in the shakti mailing list [1]. This version by
chrispsn is my favorite:

    
    
        {+/["\n"=x],+/[<':~x in "\n\r\t "],#x}
    

I think it's the one that more clearly express the intent. However, this other
one by Attila Vrabecz performs better in the current version:

    
    
        {(+/x in"\n\n";+/0>':x in" \n\t\r";#x)}@1:"big.txt"
    

[1]
[https://groups.google.com/forum/?utm_medium=email&utm_source...](https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/shaktidb/iXpNyAai_x4/3iWWt4rOAQAJ)

~~~
tluyben2
Nice, thanks for that!

------
Noe2097
What a clickbait indeed. The end version is still eating 3 times more memory,
but the worst is that it just doesn't work in the general case.

The way I generally use "wc" is inside a somewhat complex command line, with
commands preceding it. As in "feeding characters" to it.

There is a reason why "wc" is not multithreaded: it just can't. It must work
sequentially, because in the general case the input of "wc" cannot be skipped
over.

This is one of the two big assumptions that are made by the author ("wc works
only for files, so we can lseek") -- the second, identified by the author,
being that the underlying hardware and filesystem must support concurrent
access to the same location efficiently.

~~~
nightcracker
> There is a reason why "wc" is not multithreaded: it just can't. It must work
> sequentially, because in the general case the input of "wc" cannot be
> skipped over.

Eh what? Word counting is embarrassingly parallel, at least for files.

You start up k threads each working on its own chunk of the file, assuming
that there is a word boundary at the start of its chunk. Then you sum up the
total word count but inspect whether the chunk boundaries actually were word
boundaries, and subtract 1 each time this was violated.

~~~
foxhill
> Eh what? Word counting is embarrassingly parallel, at least for files.

wc does not make the presumption that it is working on a file that can be
seeked into (and, more importantly, re-wound). in c++ terms, it behaves as
though the source it is reading from is an "input iterator"[1].

if you are piping into wc, then you'll never be able to parallelise it. word
counting a file is the edge case here, not the general use.

[1] -
[https://en.cppreference.com/w/cpp/named_req/InputIterator](https://en.cppreference.com/w/cpp/named_req/InputIterator)

~~~
vidarh
You don't need to seek: Spin up workers, have them wait on a mutex, and read
blocks. If the producer is slow, then you won't need to feed multiple workers.
If it's fast, it's a matter of finding the tradeoff between block size vs. the
processing time per block.

~~~
foxhill
i expect you're being downvoted because it's likely that the cost of deciding
which thread to assign work to is an order-of-magnitude more expensive than
accumulating the reqiured counters.

in any case: you're right, it is of course possible. what i should have said
was:

> if you are piping into wc, then you'll never see a performance gain from any
> attempts at parallelisation.

~~~
vidarh
Well, I'm back at 1 point now. In any case, the amount of work needed to
assign to a thread is trivial. For a small number of threads, it at most
requires a couple of mutex operations and a couple of load/stores to
distribute each block. Given the block size can be set arbitrarily high, the
overhead of work assignment per byte can be made arbitrarily low. The biggest
issue is that you won't know the input size, and for small input sizes there
will be a cutoff point where multiple threads adds cost. But for small input
sizes, performance won't be an issue anyway.

EDIT: Actually, the simplest way of assigning work is probably just to have
each thread try to acquire a mutex, call read(), and then release the mutex
and do their work.

------
cosarara
Last week he had "Learning Haskell is no harder than learning any other
programming language"[1]. This week we have:

> The program takes more than a few minutes and quickly spikes up to more than
> 3 GB of memory! What's gone wrong? Well, we used the strict version of foldl
> (indicated by the trailing tick '); BUT it's only strict up to "Weak Head
> Normal Form" (WHNF), which means it'll be strict in the structure of the
> tuple accumulator, but not the actual values!

As easy as any other programming language!

[1]
[https://news.ycombinator.com/item?id=21170547](https://news.ycombinator.com/item?id=21170547)

~~~
chii
Every language has their esoteric quirks.

Tell me that it isn't weird to write this
[https://en.wikipedia.org/wiki/Duff%27s_device](https://en.wikipedia.org/wiki/Duff%27s_device)

Tell me it isn't strange that array access is the same as pointer deref (aka
a[index] is same as index[a])

~~~
cosarara
The Duff's device is a low level optimization that can even hurt modern
processors due to branch prediction. It's a niche trick from the olden days.

It is very strange on first sight that a[index] is same as index[a], but you
will never see the second one used in practice. When you are learning, you can
write your code using normal array access syntax and solve problems. And then,
you can learn to use pointers and deref them. And then one day you might
realize that x[y] is *(x+y) and thus y[x] and that it looks funny. But that
will be a long time after writing `wc`.

~~~
sifar
How would this work for unaligned accesses - which index would very likely be.
A neat way to look at things nevertheless, thanks.

~~~
cosarara
The alignment doesn't matter. `index` is not an offset in bytes, it's just an
index, a number. The position in memory of the variable also doesn't matter as
it is not used, only the value.

    
    
       `index[arr]` means `*(index + arr)`.
    

Which is _exactly_ the same thing as arr[index], no matter how you look at it.
The type of arr is taken into account for pointer arithmetic.

Whatever works with `arr[index]` will also work with `index[arr]`. You can
also use a constant, for example `2[arr]`.

Damn I hate HN formatting.

~~~
sifar
yes, I realized this after asking the question :).I must admit I never thought
it in this way.

Thinking more about this, this might fail if the compiler uses post-
incrementing load/stores which update the pointer addresses too. Or will it
not use them if I am using index[arr] ?

Lets say the load/store updates the pointer too. So after first execution, ptr
will index+arr in both cases, but during the next iteration , the former will
be ptr[index] while the latter will be ptr[arr].

------
Cieplak
_Writing Haskell is almost trivial in practice. You just start with the magic
fifty line {-# LANGUAGE ... #-} incantation to fast-forward to 2017, then add
150 libraries that you’ve blessed by trial and error to your cabal file, and
then just write down a single type signature whose inhabitant is Refl to your
program. In fact if your program is longer than your import list you’re
clearly doing Haskell all wrong._

Stephen Diehl — _Reflecting on Haskell in 2017_

------
Matthias247
To be fair for the C version: The linked code [1] doesn't really look like
"decades of optimized C" to me. More like a straightforward implementation of
this, which had been likely had not seen many updates for decades.

It could probably gain performance by playing around with input buffer sizes
and trying to make use of vectorized instructions (the compiler might try to
auto-vectorize the lops, but there is no guarantee).

[1]
[https://opensource.apple.com/source/text_cmds/text_cmds-68/w...](https://opensource.apple.com/source/text_cmds/text_cmds-68/wc/wc.c.auto.html)

~~~
abainbridge
Exactly this. I see this as "Simplest possible C code beats carefully hand
optimized Haskell".

------
peteretep
> We're still orders of magnitude away from wc unfortunately

Just a reminder that “orders of magnitude” means something specific, rather
than “quite a bit”, and 3 is not “orders of magnitude” more than 0.3, and
neither is 5.48 “orders of magnitude” more than 1.86

~~~
sclangdon
Actually, 3 is two orders of magnitude (or just 1 depending on whether you
count 1 as an order of magnitude or not) more than 0.3

0.3 = 3 * 10^-1, which is -1 order of magnitude

3 = 0.3 * 10^1, which is 1 order of magnitude

You are right that 5.48 is not an order of magnitude more than 1.86, however.
They are the same order of magnitude.

~~~
murderfs
> (or just 1 depending on whether you count 1 as an order of magnitude or not

Of course you don't count 1: otherwise 3 would be 1 order of magnitude greater
than 3, which is obviously wrong.

------
tom_mellior
I grabbed the source linked from the article, made it run on Linux, compiled
it with gcc -O3, and it beats my system's wc by a factor of almost 2x. C is
faster than C!

System wc, fastest of 5 runs:

    
    
          59180000  213180000 1407437834 wc_input
        11.53user 0.19system 0:11.74elapsed 99%CPU (0avgtext+0avgdata 2428maxresident)k
        0inputs+0outputs (0major+102minor)pagefaults 0swaps
    

Manually compiled wc, fastest of 5 runs:

    
    
         59180000 213180000 1407437834 wc_input
        6.08user 0.20system 0:06.29elapsed 99%CPU (0avgtext+0avgdata 2180maxresident)k
        0inputs+0outputs (0major+92minor)pagefaults 0swaps
    

This is on Ubuntu 18.04 on x86_64. If the author's wc also came precompiled
(which it seems to be), they cannot make any valid comparison at all. If
manually compiling on their machine with maximum optimization would also get
them a 2x speedup, their fancy multicore Haskell would still be slower than
C... _and that 's fine_.

------
z92
Title should be "Beating Decades of Optimized _Single_ Threaded* C with 80
Lines of _Multi_ Threaded Haskell"

That's not what I expected to read.

------
dig1
I find funny this "beating C with X" attempts and articles, popular since the
dawn of computing.

Although all of them are trying hard to be a little bit faster (and shorter)
than C code, this is usually the price they have to pay:

1\. Abysmal compilation time.

2\. Huge binaries. Sure, C static binaries aren't small, but much smaller than
equivalent Go programs.

3\. Complex language. C is still simple language (sans edgy cases). No need to
dive into type theory, learn about functors and so on just to write something.

4\. Complex implementations (compilers, garbage collectors...). Not something
you can hack over the weekend.

5\. Complex distribution. C natural environment is Unix/Linux and libc/glibc
is always there. Everything else requires additional (usually tons of)
libraries or static linking.

6\. Language instability. Because of 3), these languages are always evolving.
You can read and compile C code 20-30 years ago and I'm expecting the same for
the next, at least, 10 years.

------
_bxg1
Very interesting, and a simple enough use-case that a Haskell novice can get
the gist of it. It's very informative of how real Haskell development works;
it's a brass-tacks kind of example which you don't necessarily see a lot of in
the Haskell world.

------
zelly
Every ”X new thing I made on a weekend is faster than Y old thing” is
inevitably a half-implemented version of the old thing. Of course doing less
instructions is going to be faster. You could also make a benchmark-winning
web server that just write()s to a socket, skipping all sanitization and error
handling.

------
djmips
Is this code that they are beating actually 'hand-optimized C'? Because the C
code at a glance I didn't get that impression.

~~~
yxhuvud
Yes, the C is much too readable. Also I wonder how fresh it is - the Macs have
been known to run some really old versions of GPL tools due to license issues
with newer versions.

~~~
saagarjha
2008, and I'll have you know that it's

    
    
      Copyright (c) 1980, 1987, 1991, 1993
          The Regents of the University of California.  All rights reserved.

------
Smithalicious
> Basically we need to count the number of times a given invariant has changed
> from the start to the end of a sequence.

I can say with confidence that the invariant will change exactly 0 times. No
clue what the author is trying to say here.

------
zmix
Hasn't Pandoc[1] been written in Haskell? The only real slowdown I experience
with it is startup time.

    
    
        > If you need to convert files from one markup format into another, pandoc is your swiss-army knife.
    

So, Pandoc does a _lot_ of text processing.

[1] [https://pandoc.org/](https://pandoc.org/)

------
UweSchmidt
In general terms, how is it possible to beat really optimized C code? In
particular, what property of Haskell or any other functional language would
make that possible (in theory or practice).

~~~
ummonk
The C code is not particularly optimized, but more importantly, C being a low
level language makes writing multi-threaded programs much more work. wc though
cannot make use of multi-threading regardless because it must have the ability
to work on streams, not just seekable files, as the haskell code here does.

~~~
Annatar
C is a high level language with capability of low level access: we regularly
made fun of C programmers in high school (like my math teacher) because they
didn't know how to code in assembler and had to write abstract high level code
in C, in order for the compiler to do it for them. The biggest brouhaha was
always disassembly of "optimized" machine code generated by the C compiler,
full of needless stack twiddling and housekeeping code, wasting hundreds or
even thousands of CPU cycles on what amounts to useless abstractions. That's
nonsense one finds only in a high level language. But it's always super for a
good brouhaha!

~~~
ummonk
We've long since moved on from an era when anything above assembly could be
considered high level.

And the performance difference between idiomatic C code compiled with -O3 (to
inline functions and minimize the stack twiddling and housekeeping code) and
handwritten straightforward assembly is low enough as to be negligible. Not so
with, say, Haskell vs C.

~~~
Annatar
Neither is true. Well, the first might be true for you in your mind, but
doesn't change the fact that C is a high level language modelling an abstract
system, and providing that model through its standard libraries (which is what
makes it possible to write portable code).

And the second with -O3 could not be more wrong, and I can prove it: compile
anything with gcc -S -O3 and behold the extra code which a human would never
write because it's unnecessary.

~~~
tom_mellior
> I can prove it: compile anything with gcc -S -O3 and behold the extra code
> which a human would never write because it's unnecessary.

Since you claim that that's true for "anything", care to show us an example
consisting of a 5-10 line C function and your assembly version?

~~~
Annatar
Considering I wrote above exactly what to do (gcc -S), I believe you are not
making this request in good faith; rather, you seek to test whether I know
what I'm writing about. However, had you done what I told you to do, you
wouldn't have asked. That's why I think you are actually trolling me.

~~~
tom_mellior
Input code:

    
    
        float dot_product(float *a, float *b, int n) {
            float result = 0.0f;
            for (int i = 0; i < n; i++) {
                result += a[i] * b[i];
            }
            return result;
        }
    

Output from gcc -O3 -S:

    
    
            .file   "foo.c"
            .text
            .p2align 4,,15
            .globl  dot_product
            .type   dot_product, @function
        dot_product:
        .LFB0:
            .cfi_startproc
            testl   %edx, %edx
            jle .L4
            leal    -1(%rdx), %eax
            pxor    %xmm0, %xmm0
            leaq    4(,%rax,4), %rdx
            xorl    %eax, %eax
            .p2align 4,,10
            .p2align 3
        .L3:
            movss   (%rdi,%rax), %xmm1
            mulss   (%rsi,%rax), %xmm1
            addq    $4, %rax
            cmpq    %rax, %rdx
            addss   %xmm1, %xmm0
            jne .L3
            rep ret
            .p2align 4,,10
            .p2align 3
        .L4:
            pxor    %xmm0, %xmm0
            ret
            .cfi_endproc
        .LFE0:
            .size   dot_product, .-dot_product
            .ident  "GCC: 7.4.0"
            .section    .note.GNU-stack,"",@progbits
    

Where's the unnecessary code? What does your hand-optimized version look like?
Is your hand-optimized version faster than this?

~~~
Annatar
This is how _a human_ would code the same (a human would know, for instance,
that (s)he wants to multiply 50 times, passing that in %rax (that would be
your "n" in C), something a compiler can never know, which is _one of many
reasons_ (and there are _many_!) why a compiler can _never, ever_ beat a human
at coding in assembler:

    
    
      DotProduct: subss %xmm0, %xmm0
      .Loop:      movss (%rdi, %rax, 4), %xmm1
                  mulss (%rdi, %rax, 4), %xmm1
                  addss %xmm1, %xmm0
                  subq $1, %rax
                  cmpq $0, %rax
                  jnz .Loop
                  ret
    

...potentially, depending on how 80x86 processors work, the Z flag might be
set when subq $1, %rax reaches 0 and the cmpl $0, %rax could be optimized
away, making the code even shorter (and faster). But I'll leave it in there
because this handily beats the compiler in both size and speed, even with the
compiler often trying to cheat with nopw instructions to warm up the data
cache. Now compare that with the garbage GCC generated. At -O3 no less! What a
bunch of garbage!

On a personal note, you forced me to write code for a processor architecture I
do not even know (80x86 family of processors in 64-bit mode), _at all_ , so
you just assumed my primary system is powered by that intel / AMD garbage of
processors, and I don't appreciate that one bit. And you doubted me, which I
appreciate even less.

~~~
tom_mellior
> a human would know, for instance, that (s)he wants to multiply 50 times

A human would turn the for loop that counts up into a do-while loop that
counts down? That's doable on the C level, and in that case GCC also generates
less ceremony:

    
    
        dot_product_2:
        .LFB1:
            .cfi_startproc
            pxor    %xmm0, %xmm0
            movslq  %edx, %rdx
            .p2align 4,,10
            .p2align 3
        .L8:
            movss   (%rdi,%rdx,4), %xmm1
            mulss   (%rsi,%rdx,4), %xmm1
            subq    $1, %rdx
            leal    1(%rdx), %eax
            testl   %eax, %eax
            addss   %xmm1, %xmm0
            jg  .L8
            rep ret
    

But of course if you do this you have changed the implied original spec, and
you have reassociated a floating-point computation, computing different
results. At that point you might as well pass -ffast-math to GCC, and it will
beat your code by 8x due to vectorization.

> But I'll leave it in there because this handily beats the compiler in both
> size and speed

Not on my machine. On arrays of 1048576 (2 ^ 20) elements I didn't measure any
difference between my first variant, the do-while variant and yours. On arrays
of 1024 elements I still didn't measure a difference between the two compiled
versions, but yours is about 1-2% slower. On 128 elements do-while looks a bit
slower than for, and your variant is more than 20% slower.

Here are some numbers for n = 128, calling each function 10000000 times.
dot_product is the original, dot_product_2 uses do-while, dot_product_3 is
your code (with an extra move of edx to rax at the start of the function, of
course).

    
    
        dot_product:   1.036514 sec
        dot_product_2: 1.097186 sec
        dot_product_3: 1.290411 sec
    

(Edit: This seems to be due to your use of subss instead of pxor to zero xmm0
at the beginning. Removing that gets your code back up to only 1-2% slower
than the for loop version.)

> even with the compiler often trying to cheat with nopw instructions to warm
> up the data cache

What?

> Now compare that with the garbage GCC generated. At -O3 no less! What a
> bunch of garbage!

You still haven't pointed out what exactly you think is garbage. Which line?
Which instruction? Or don't you like that it generates some metadata?

> you forced me to write code for a processor architecture I do not even know

I didn't force you to do anything, I specifically asked you to provide an
example of your choice. You could have chosen a piece of code where you knew
in advance GCC would do badly. You could have chosen an architecture that you
know well and/or one where you know that GCC's backend hasn't had as much work
as x86-64. You chose not to do this. I didn't force you to do anything.

> And you doubted me

Based on the data it looks like I was right to doubt you.

~~~
Annatar
"A human would turn the for loop that counts up into a do-while loop that
counts down?"

Of course an experienced coder would do that; that's the entire point of why a
compiler will never be able to generate as short and as fast of code as a
human would. An experienced coder can see opportunities that a program with a
hard-coded logic such as a compiler is incapable of seeing; if it were, we
would have true artificial intelligence, with the neural network of a human
brain or denser.

"That's doable on the C level,"

For some bizarre reason unbeknown to me, you just cannot seem to get your head
around writing a program in assembler from start to finish. An experienced
assembler coder looking for performance would not use a high level language
like C because it would be a waste of her or his time, since they already know
from experience that they can easily beat hard-coded program logic of a
compiler (and if they did, they would chose Fortran over C). Programming in a
high-level language only makes sense when the trade-off of losing performance
is acceptable in the name of portability. Even then, there are situations
where a portable program in a high level language has hand-written, per-
processor assembler code (OpenSSL comes to mind). Why do you think that is?

"What?"

Some versions of GCC for intel 80x86 family of processors will generate one or
more nopw instructions, the idea being to give the processor time to prime the
data cache. It's a cheap gimmick of desperate compiler constructors, more of a
gamble really, because cache games heavily depend on a very particular
processor model in a processor family.

"At that point you might as well pass -ffast-math to GCC, and it will beat
your code by 8x due to vectorization."

I could have vectorized this myself (that was actually the first thought I
had) but chose not to because I didn't want to spend the time researching how
to do that on the intel 80x86 family of processors; I could have also unrolled
the loop about four to eight times for even more of a performance gain, but
chose not to do so because I believe I made my point. Now you are just being
purposely obtuse, because it seems that you just don't want to accept that
high level compilers still aren't as fast as a human coder is: you chose a dot
product because you likely knew that the compiler would generate pretty fast
code, but that one isolated example doesn't really prove anything. If you do
more comparisons over _decades_ like I have, you will see how silly it was
trying to argue that compilers generate faster code than coders.

Also, -ffast-math can generate code which will run the fastest only on that
particular processor (and might generate instructions which will not run on
other processors of the same family), but the far worse danger is that -ffast-
math won't generate IEEE-754 compliant results. When one needs very high
precision floating point arithmetic, -ffast-math cannot guarantee correct
results. For these two reasons, -ffast-math is useless in real life
applications. And indeed, wouldn't you know it:

    
    
         -ffast-math
             Sets        the         options         -fno-math-errno,
             -funsafe-math-optimizations,         -ffinite-math-only,
             -fno-rounding-math,       -fno-signaling-nans        and
             -fcx-limited-range.
    
             This    option    causes    the    preprocessor    macro
             "__FAST_MATH__" to be defined.
    
             This option is not turned on by any  -O  option  besides
             -Ofast  since it can result in incorrect output for pro-
             grams that depend on an exact implementation of IEEE  or
             ISO  rules/specifications  for  math  functions. It may,
             however, yield faster code  for  programs  that  do  not
             require the guarantees of these specifications.
    

"(Edit: This seems to be due to your use of subss instead of pxor to zero xmm0
at the beginning. Removing that gets your code back up to only 1-2% slower
than the for loop version.)"

There are two ways to do it fast: eor (or what the idiotic intel architecture
calls "xor"), or the sub instruction. On the Motorola MC680x0 family, both run
at the same speed. Up until that time, I have _never_ in my entire life
written a single line of assembler code for the intel 80x86 family of
processors. What you saw is the first assembler code I ever wrote for that
platform; I even had to look up which registers to use for floating point and
how to do floating point arithmetic on that family of processors, and my hard
target was less than 15 minutes of total time wasted on a _someone is wrong on
the InterNet_ ([https://www.xkcd.com/386/](https://www.xkcd.com/386/))
discussion. If I could do that in under 15 minutes for a processor family _I
literally know nothing about_ , imagine what a real coder experienced in
writing 64-bit assembler code for the intel 80x86 family of processors can do.

"You still haven't pointed out what exactly you think is garbage. Which line?
Which instruction? Or don't you like that it generates some metadata?"

I haven't because I was on a mobile telephone and it was just too much of a
bother to quote and reply what I wanted to address. I had to log into a
workstation to reply to this and I resent that too. I don't like _all of the
above_ , the code and the metadata the compiler generated, because it's
machine generated nonsense which jumps back and forth and it is hard to read
and comprehend. Assembler code written by a human isn't. Now, since I already
resent what I'm doing, I might as well go all the way. Let's pick apart the
compiler generated examples in reverse order.

    
    
            movslq  %edx, %rdx
    

movslq isn't even necessary, but the compiler, having hard-coded logic and
being just a program generated it anyway. Waste of CPU cycles.

    
    
            .p2align 4,,10
            .p2align 3
    

more assembler directives because a compiler cannot figure out whether such
packing is necessary or not. My code didn't need them because I know that I
don't.

    
    
            leal    1(%rdx), %eax
            testl   %eax, %eax
    

again, nonsense: two instructions wasted with what could amount to a single
subq. This isn't more efficient.

    
    
            rep ret
    

Why, when a ret will do? Obviously in my code it worked, but the compiler felt
that it had to generate this nonsense, which on top of being nonsense is
harder to read and understand, and isn't any faster.

Next example:

    
    
            testl   %edx, %edx
            jle .L4
            leal    -1(%rdx), %eax
    

testl is a total waste of processor cycles here. Which purpose does it serve?
Then more cycles are wasted on jumping to an instruction to clear %xmm0,
presumably to be able to utilize the instruction pipeline, but all the jump
does is waste cycles.

    
    
            xorl    %eax, %eax
    

why xorl %eax in a loop when it could be done once? More waste of processor
cycles. That code is extremely inefficient, and hopes to trade off readability
for banking on a specific feature of intel 80x86, hoping that the instruction
cache and pipelining will offset the losses. Idiotic in the extreme
considering a human could get the same performance with far less and simpler
code!

"Based on the data it looks like I was right to doubt you."

It might look that way to you now, but if you choose to acquire more
experience in coding in assembler, you will eventually realize that I'm
correct. You just need more experience, and from that will come insight. Run
more tests on more algorithms; study the intel / AMD architecture manuals. Try
to construct your own compiler, which I think will really make you realize why
a program cannot beat a human. Shame only that you have to deal with such a
shitty family of processors, but even that insight will be extremely valuable.

~~~
tom_mellior
> Even then, there are situations where a portable program in a high level
> language has hand-written, per-processor assembler code (OpenSSL comes to
> mind). Why do you think that is?

It makes sense at times, in specific contexts. Which is a far cry from your
"compile _anything_ with gcc -S -O3 and behold the extra code which a human
would never write" (emphasis mine).

> nopw instructions, the idea being to give the processor time to prime the
> data cache.

I'm still only 95% sure what you mean by "priming the data cache", but if you
mean prefetching, (a) nop doesn't prefetch; and (b) there are prefetch
instructions for prefetching. And in any case prefetching wouldn't be
cheating.

> If you do more comparisons over decades like I have

Compilers have gotten much better over the last few decades. Maybe the things
you "know" are "always" true were true at some point in time and aren't true
anymore.

> the far worse danger is that -ffast-math won't generate IEEE-754 compliant
> results

Yes, that's what I said. You took my original function and wrote a version
that didn't generate IEEE-754 compliant results. If you're OK with changing
the semantics, you should use -ffast-math and let the compiler vectorize.

> There are two ways to do it fast: eor (or what the idiotic intel
> architecture calls "xor"), or the sub instruction

For whatever it's worth, I think sub would be incorrect if the original bit
pattern in the register were a NaN.

> two instructions wasted with what could amount to a single subq. This isn't
> more efficient.

And yet, somehow, this version of the code runs faster than your version.

> testl is a total waste of processor cycles here. Which purpose does it
> serve? Then more cycles are wasted on jumping to an instruction to clear
> %xmm0

That's the code for testing whether the for loop should ever be entered. If n
is less than or equal to zero (that's what's being tested by testl/jne), the
function returns zero. Which is why xmm0 (the return register) needs to be
zeroed in that case.

> why xorl %eax in a loop when it could be done once?

It's not in a loop. It's the "i = 0" setup code before the loop.

You have made it very clear that you don't feel qualified to write highly
optimized x86-64 code. Neither are you qualified to judge the quality of
x86-64 code if you can't tell what is inside a loop and what isn't.

Two last points before I drop this thread:

> you will see how silly it was trying to argue that compilers generate faster
> code than coders

I didn't argue that. I argued that your assertion "compile _anything_ with gcc
-S -O3 and behold the extra code which a human would never write" (emphasis
mine, again) was incorrect. That doesn't mean that I think that gcc will
always, or even sometimes, beat human coders. But it can match them very very
often.

> you chose a dot product because you likely knew that the compiler would
> generate pretty fast code [...] More waste of processor cycles. [...]
> Idiotic in the extreme

You're contradicting yourself. And you are calling idiots the many GCC
developers and Intel/AMD microarchitecture experts who _very likely_ have
pored over every single instruction of _this very code_ and decided that this
is the way it should be written for maximum performance.

I hope you have a wonderful day.

~~~
Annatar
"(a) nop doesn't prefetch;"

I wrote nopsw and you are writing about nop. Are you doing this on purpose?
nop doesn't, but _only on intel family of processors_ , nopsw has a side-
effect of prefetching.

"Yes, that's what I said. You took my original function and wrote a version
that didn't generate IEEE-754 compliant results."

Turns out, so did the GCC compiler, at least the one I have, so I'd say your
point is moot.

Truth of the matter is, you picked a really bad example: to solve it
correctly, one would have to implement at least a portion of the algorithms in
the GNU multiple precision library ("GMP"). I suspect you picking a floating
point example was not by accident.

"I think sub would be incorrect if the original bit pattern in the register
were a NaN."

Even NaN has to be represented by a bit pattern, so subtracting that bit
pattern from the register will yield zero.

"That's the code for testing whether the for loop should ever be entered."

And here we come back to my point: if you were coding this from scratch in
assembler, you _wouldn 't_ write a generic function, and you'd know that n
will never be zero. And the reason why you'd never write a generic function is
because they lose you speed and increase code size. But a compiler cannot know
that and cannot optimize for such a situation. It's just a dumb program.

"You have made it very clear that you don't feel qualified to write highly
optimized x86-64 code. Neither are you qualified to judge the quality of
x86-64 code if you can't tell what is inside a loop and what isn't."

I spent 30 seconds looking at assembler code for a processor family I have
never coded on. I spent less than 15 minutes writing a piece of optimized
assembler code for that family and using GNU as, an assembler I never wrote
code in. Now you judge me on mis-interpreting one clumsily generated compiler
instruction. By which logic, considering I was able to do all of this in under
15 minutes am I not qualified? I'm very pleased with myself, for the time
budget, an unknown processor and unknown assembler I think I did very well. We
will have to disagree, vehemently if you please.

I stand by my assertion that a compiler will never be able to beat a human at
generating fast, optimized code, nor will it ever be capable of generating
smaller code. In addition I don't hold the GCC developers in high regard,
considering how notoriously bad their compilers are when compared to say,
intel or Sun Studio ones. Even the Microsoft compilers beat GCC in generating
code which runs faster. In fact, pretty much every compiler beats GCC in
performance, which means that people working on those GCC compilers aren't
good enough. GCC's only undisputed strength is in the vast support of
different processors. There, they are #1, but everywhere else they're last.
The GCC developers just don't have what it takes to be the best in that
business.

"And yet, somehow, this version of the code runs faster than your version."

I don't know that; you ran code which I wrote _blindly_ ; I was not even able
to reproduce your output with my GCC. That it runs faster is just your
assertion. Based on my experience, I have no reason to believe that.

"I hope you have a wonderful day."

As a matter of fact, I am about to go create a SVR4 OS package of GCC 9.2.0
which I patched and managed to bootstrap after a week worth of work on Solaris
10 on sparc, so yes I will have a wonderful day enjoying the fruits of my
labors. I wish you a wonderful day as well.

------
atemerev
So, if you can write research-level stuff, invent some new category-
theoretical structures that warrant a scientific paper, know the exact
language extension to enable from the multitude of exotic optimizations
available in GHC, than yes, that’s perfectly simple and accessible to
everyone!

~~~
chii
It's an unfair comparison - monads are relatively well known concept, and
compiler options and libraries are just as baffling in C as it is in GHC.

In all respects, the only "esotetic" thing is the WHNF (which, if you worked
with haskell for any while, is quite understandable).

~~~
atemerev
Flux monoids? You have to invent the new subclass of traversing monoids to
achieve somewhat comparable performance to imperative code?

~~~
chii
it is merely a monoid with some special property (which, is a logical property
of concurrently counting words - "count the number of times a given invariant
has changed from the start to the end of a sequence").

he wrote the monoid directly, rather than import the library - and it's a
rather unimportant small detail.

He didn't invent a new research topic that nobody has seen before. You
would've implicitly done this same thing if you wrote a map/reduce algorithm
to count words across different machines.

------
loosetypes
Reminded me of the following paper, Using Coq to Write Fast and Correct
Haskell.

[https://www.cs.purdue.edu/homes/bendy/Fiat/FiatByteString.ht...](https://www.cs.purdue.edu/homes/bendy/Fiat/FiatByteString.html)

------
maccard
As a few have posted, this isn't decades of optimised C vs 80 lines of
Haskell, this is basic C vs basic Haskell, and C is orders of magnitude
faster.

The entire source for wc is here [0], and is 900 lines of very readable C, it
does more than count words, includes options, arguments, usage, licenses. WHen
I strip everything back, it's only about 300 lines of very readable C.

[0]
[https://github.com/coreutils/coreutils/blob/master/src/wc.c](https://github.com/coreutils/coreutils/blob/master/src/wc.c)

------
saagarjha
Here is some more up-to-date source, though it still hasn't been updated since
2008:
[https://opensource.apple.com/source/text_cmds/text_cmds-99/w...](https://opensource.apple.com/source/text_cmds/text_cmds-99/wc/wc.c.auto.html)

------
Someone

      >>> foldMap countChar "one two\nthree"
      Counts {charCount = 13, wordCount = Flux NotSpace 3
      NotSpace, lineCount = 1}”*
    

That _lineCount = 1_ surprised me, but testing on a Mac showed it to be
‘correct’. At least there, you can have a non-empty file that has zero lines.

~~~
boomlinde
In POSIX, a text file is a file consisting of zero or more lines. A line is a
sequence of non-null characters, shorter than a system defined maximum and
terminated by a line feed character.

So a file that completely matches the string "one two\nthree" is by POSIX
standards not a text file. Of course, wc doesn't necessarily operate on text
files so you could interpret the result as a "complete POSIX line count".

------
axilmar
Program running with multiple threads beats program running without multiple
threads. News at 11.

------
sergeyz_prg
The author compares a multi-core application performance to a single-core one
. Is it fair?

------
olliej
Duplicate of
[https://news.ycombinator.com/item?id=21261004](https://news.ycombinator.com/item?id=21261004)

------
verisimilitudes
This is interesting and also unsurprising. Haskell is an abstract and high-
level language and so doesn't dictate irrelevant machine details as C does.
The only reason C's considered efficient at all is due to what was effectively
a meme from forty years ago; there have been lifetimes of human effort wasted
making C fast, only to trivially be beaten by someone using better tools. A
proper analogy would be beating the world's faste runner in a mile-long dash
by using a bicycle or perhaps using a cart being more efficient than carrying
items on one's head; the main difference is people haven't been deluded to
think that humans are faster or better at carrying cargo than a creation of
mankind.

