
Beating C with 70 lines of Go - todotask
https://ajeetdsouza.github.io/blog/posts/beating-c-with-70-lines-of-go/
======
jstanley
Alternatively, here's my entry for "Beating C with 40 lines of C":
[https://pastebin.com/JzFfE5GB](https://pastebin.com/JzFfE5GB)

    
    
      $ time wc -w 100m
      2266395 100m
      
      real 0m4.568s
    
      $ cc -o wc wc.c
      
      $ time ./wc 100m
      2343390 100m
      
      real 0m0.511s
    

Of course, it disagrees on the answer, because I just used 100M of random data
and it doesn't care about wide characters. It gives the same answer as GNU on
plain ASCII text.

It's not faster because it's better, it's faster because it is doing less.

~~~
Quekid5
> It's not faster because it's better, it's faster because it is doing less.

I can't recall who said it, but this reminds me of this idea: If you don't
care about correctness then I can make it as fast as you like.

------
fao_
Something I posted elsewhere:

All of these articles are frustrating because they use different environments
and test sets and none of the ones I’ve read have posted the test sets up.
Some people use random characters, some people use existing files. Some people
use files of 1 MiB, some 100 MiB, some several GiB in size. Not only that, but
the people programming the replacements don’t even normalize for the
difference in machine/processor capability by compiling the competitors and
GNU wc from scratch. The system wc is likely to be compiled differently
depending on your machine. The multithreaded implementations are going to
perform differently depending on if you’re running Chrome when you test the
app or not, etc.

This would easily be solved by using the same distribution as a live USB,
sharing testing sets, and compiling things from scratch with predefined
options, but nobody seems to want to go to that much effort to get coherent
comparisons.

~~~
eMSF
>none of the ones I’ve read have posted the test sets up

I think the earliest(?) entry[1] in this "series" (the one done in Haskell)
had their test input shared alongside the source code, and it has been
referenced in some others (often multiplied several times over, as the
original isn't very big). Beyond that it's ASCII only, the contents matter
little.

>nobody seems to want to go to that much effort to get coherent comparisons

I think nobody really wants to get any coherent comparisons, because this
thing isn't really a competition between the entries themselves.

[1]: [https://github.com/ChrisPenner/wc](https://github.com/ChrisPenner/wc)
(see data/big.txt)

------
jimbob45
These articles seem like they're playing code golf more than proving anything
as a viable alternative to C.

The goal is to analyze your needs and pick the language best suited to your
task. The goal is _not_ to find one language that excels at everything.

~~~
zopppo
I could be wrong, but it seems like it wouldn't be difficult to outperform
these implementations with a similarly hacky C implementation. I expect that
blog post in the coming days.

~~~
barbegal
It's mentioned in the post [https://github.com/expr-
fi/fastlwc/](https://github.com/expr-fi/fastlwc/)

~~~
zopppo
Ah, I didn't see that. Thanks!

------
tuckerpo
Yeah, let me just multithread and hand-optimize the piss out of this Go
program until it's marginally faster than it would have been in C.

Reminds me of the argument that venison tastes better than beef. The argument
roughly being, "If you shoot the deer right, drag it home right, gut it right,
skin it right, tenderize it right and cook it _just_ right, it'll be _almost_
as good as store-bought frozen beef"

~~~
ajeetdsouza
Author here. I think you should go through the article again. I think it's
quite readable, and there are no "hand-optimizations" as you say. Also, the
single-core implementation was already faster than the C version - the
multithreaded version was only done to explore different methods of
concurrency in Go.

Hope that clarifies things.

------
ksaj
Speedy CPUs with lots of cores are cheap enough that the time taken for
garbage collection may very well be worth it.

On the other hand, I do still love code golf and other speed/size/wonkiness
competitions. It's a lot like the early demo scene whose sole effort was to
show how much they could do with very limited resources.

------
devjam
Under the "Better parallelisation" section, you don't need to proxy methods
for your sync.Mutex if you embed the type.

    
    
        type FileReader struct { 
            File            *os.File
            LastCharIsSpace bool
            sync.Mutex
        }
    

And your Lock() and Unlock() calls on FileReader would just work.

~~~
ajeetdsouza
Fixed, thank you!

------
pleasecalllater
How many times have the tests been run? Once? First for C, then for Go? What
about disk caching then?

What I can see there is that for the same algorithm, the Go version was not so
fast. It was comparable. The memory overhead could be caused by the size of
the program, as wc has more functionalities.

Then there is compared a totally different algorithm with a false claim "this
way a go implementation is faster than a c one". Sure it is, as this is a
different implementation of a different algorithm. A fair comparison would be
implementing the same algorithm in c and comparing then. I assume the
difference wouldn't be huge.

So, generally, I think it's not a fair comparison.

~~~
ajeetdsouza
The tests were run 10 times, and I used the median value. There wasn't much
variance between runs, so I don't think disk caching played much of a role
here.

Being a garbage collected language with a runtime, Go certainly cannot match
the performance of C, and it was never my point to prove otherwise -
obviously, for the same algorithm, the C implementation would be faster.
Instead, I was exploring Go to highlight how simple it is to write safe,
concurrent code in it.

------
esmi
wc seems like a bad example because it’s basically IO bound. The title should
read, “Go’s built in bufio reader is faster than raw reads on the file
descriptor.” Which it should be because that’s the point of bufio.

~~~
danieldk
No, it's not faster. They are comparing a Go version that does not do
character decoding (which is necessary for correctly counting the number of
words under the presence of non-ASCII punctuation) to a C version that does
decode characters (and matches them against a larger character set with
_iswspace_ ).

This can be easily shown, by counting words and lines using _wc_ separately.
Word counting decodes characters (to find non-ASCII whitespace that may
separate words), whereas line counting just looks for ASCII line separators:

    
    
        $ time wc -w wiki-large.txt
        17794000 wiki-large.txt
        wc -w wiki-large.txt  0.48s user 0.02s system 99% cpu 0.496 total
        $ time wc -l wiki-large.txt
        854100 wiki-large.txt
        wc -l wiki-large.txt  0.02s user 0.01s system 99% cpu 0.034 total
    

So, without character decoding, looking at every byte is ~15 times faster. So,
if you'd compile wc without multibyte character support (which would be a fair
comparison), it would probably beat Go without any parallelization.

~~~
ajeetdsouza
This is not true, see my comment here:
[https://news.ycombinator.com/edit?id=21587907](https://news.ycombinator.com/edit?id=21587907)

~~~
danieldk
Take the Darwin version linked from your site. Run _perf record wc
thefile.txt_. Then run _perf report_ and you will see _iswspace_ in the call
graph.

As I show in [1], removing this call and replacing it by a character match,
gives a speedup of almost 2x.

[1]
[https://news.ycombinator.com/item?id=21592089](https://news.ycombinator.com/item?id=21592089)

------
big_chungus
> I hope it demonstrates that Go can be a viable alternative to C as a systems
> programming language.

Are you kidding me? This is nothing close to a systems programming language.
This isn't much of a comparison at all. wc is a very simple case that doesn't
match the complexity of real-world programs. Go comes close here because
you're not using the high-level abstractions and that make it useful in the
real world. GNU coreutils also tend to focus on having tons of features
(compared to BSD/busybox/plan9/other), which can slow them down. If you really
want to get competitive, I bet an AVX-512 implementation would be fastest, and
that's more doable in C, but this is a bogus comparison in any case. It's just
people doing this because they like a specific language.

~~~
aikah
Furthermore this isn't "system programming", he just replaced a "user land"
program with another. But I blame the Go team for turning "system programming"
and "realtime programming" into useless buzzwords to promote their language.

~~~
Gibbon1
Yeah when I first looked at go that was the first 'no fu' for me. Go isn't a
systems or real time programming language. It's a managed language with
training wheels. Which is okay by me. Lying about it though is not okay.

The second FU is they claim that decisions they made based on personal
preferences were technical ones. That's a very insidious lie that programmers
make all the time. Insidious because it destroys trust between programmers and
managers.

~~~
pjmlp
This is not lying:

\- gVisor hypervisor on Google Cloud and Linux sandbox on Chromebooks

\- Android GPGPU debugger

\- Fuchsia TCP/IP stack and volume management

\- Baremetal TinyGo on Arduino Nano33 IoT, Adafruit Circuit Playground
Express, BBC micro:bit among many others

\- Coreboot firmware

\- Biscuit POSIX like OS

But whatever, the GC-FUD is strong among C devotees.

------
0xdead
> Go can be a viable alternative to C as a systems programming language

Stop this nonsense please. This does not show anything even remotely close to
systems programming capability of Go. Write a device driver in Go that
performs without lagging, benchmark it and then come back.

~~~
grumpydba
[https://news.ycombinator.com/item?id=18399389](https://news.ycombinator.com/item?id=18399389).

It's been done. Performs well.

~~~
rowanG077
They quite literally were forced to use C for some parts because Go is not a
systems programming language. They write this in the article.

~~~
grumpydba
> As we wanted to keep unsafe code to a minimum, we instead chose to employ
> the C code from the original driver as an opportunity to present cgo.

It's not for performance reasons. I think you misread. Also the driver is pure
go now:

[https://github.com/ixy-languages/ixy.go](https://github.com/ixy-
languages/ixy.go)

~~~
rowanG077
Systems language is not only about performance. Where did you get that?

~~~
grumpydba
They never said that they had to use c because go is not a systems language,
so your assertion looks wrong. They wanted to avoid using unsafe. In c
everything is unsafe by the way, so it makes it less of a systems language?

~~~
rowanG077
No Go isn't a systems language because for one you don't have direct control
over memory if you need it. For instance Go doesn't even have the volatile
keyword which is essential in many cases when interfacing with hardware. The
paper you linked laments this as well.

------
bobowzki
Because they compare multi threaded to single threaded.

~~~
ajeetdsouza
Author here. If you read the article carefully, you'll see that the 70 lines I
used to outperform wc was a single-threaded implementation. I multi-threaded
it later for overkill.

------
gigatexal
Looks like clean idiomatic go code to me.

------
timClicks
I'm really surprised by how much memory this task needs. Using a 16KB buffer
and incrementing an integer needs multiple MB.

I guess that's the size of the executable after it's been loaded into RAM
that's so large?

~~~
weberc2
Yeah, if that figure includes the executable, then it's also probably
including the whole runtime (scheduler, GC, etc) since Go programs statically
link the runtime by default. In that case, 2MB isn't so bad (especially
considering glibc is ~10MB [and with no scheduler or GC!] iirc).

------
zhangxp1998
Okay, so you are comparing single-threaded GNU wc with your multi-threaded Go
implementation? That's fair.

------
danieldk
Another article in the series _beating C by moving the goal posts_. My comment
on the original article about the Haskell version on lobste.rs:

Keep in mind in all comparisons to GNU _wc_ that it does extra work, detecting
multi-byte characters and decoding multi-byte characters if present, to
correctly count the number of words. _perf_ shows a significant amount of time
being spent in multibyte character handling. If you trigger a code path that
does not do decoding beyond the byte level, it’s much faster:

    
    
        $ time wc wiki-large.txt 
        854100  17794000 105322200 wiki-large.txt
        wc wiki-large.txt  0.42s user 0.02s system 99% cpu 0.438 total
        $ time wc -l wiki-large.txt
        854100 wiki-large.txt
        wc -l wiki-large.txt  0.02s user 0.02s system 98% cpu 0.034 total
    

( _wc -l_ looks at every byte, but does no decoding.)

From a quick glance, this is also where the 'Haskell beats C' article fails.
It’s comparing apples to oranges, the _ByteString_ implementation does not do
the same as GNU/macOS _wc_ and returns incorrect results in the presence of
non-ASCII punctuation. The article incorrectly states that _wc_ will handle
input as ASCII. Unless you do not use a multi-byte locale, macOS _wc_ uses the
combo of _mbrtowc_ and _iswspace_.

~~~
ajeetdsouza
Author here. This is not true - I included a link to the manpage
([https://ss64.com/osx/wc.html](https://ss64.com/osx/wc.html)) in the article
to avoid this confusion. I did not use GNU wc; I used the OS X one, which, by
default, counts single byte characters. From the manpage:

> The default action is equivalent to specifying the -c, -l and -w options.

> -c The number of bytes in each input file is written to the standard output.

> -m The number of characters in each input file is written to the standard
> output. If the current locale does not support multi-byte characters, this
> is equivalent to the -c option.

Moreover, I also mentioned in the article that I was using us-ascii encoded
text, which means that even -m would have been treated as ASCII text.

Hope that clarifies your issue.

~~~
danieldk
It is not about the character count, but the word count. wc decodes characters
to find non-ASCII whitespace as word separators. If you read further in the
same man page:

 _White space characters are the set of characters for which the iswspace(3)
function returns true._

That your text is ASCII encoded does not matter, since ASCII is a subset of
UTF-8. So at the very least, you need an extra branch to check that a byte's
value is smaller than 128 (since any byte that does not start with a leading
zero is a multi byte character in UTF-8).

However, if you look at the implementation at

[https://opensource.apple.com/source/text_cmds/text_cmds-68/w...](https://opensource.apple.com/source/text_cmds/text_cmds-68/wc/wc.c.auto.html)

You can see that in this code path it actually uses _mbrtowc_ , so there is
also the function call overhead.

~~~
eMSF
It only calls _mbrtowc_ if _domulti_ is set (and MB_CUR_MAX > 1), i.e. only
when given the option -m.

~~~
danieldk
You are right! So that's a Darwin oddity, still a wide char function is called
in that code path, iswspace, which adds the function call overhead in a tight
loop.

~~~
tom_mellior
If domulti is not set, the wide char function is _not_ called as far as I can
tell. Why would it? It's explicitly meant not to do wide char stuff in that
case.

FWIW, when this was going around for the first time I took this Darwin version
of wc and experimented with setting domulti to const 0, statically removing
all paths where it might do wide character stuff. I didn't measure any
performance difference to just running it unmodified.

~~~
danieldk
It's about _iswspace_ as I mentioned in the parent comment. Replace the line

    
    
        if (iswspace(wch))
    

by

    
    
        if (wch == L' ' || wch == L'\n' || wch == L'\t' || wch == L'\v' || wch == L'\f')
    

And I get a ~1.7x speedup:

    
    
        $ time ./wc ../wiki-large.txt
          854100 17794000 105322200 ../wiki-large.txt
        ./wc ../wiki-large.txt  0.47s user 0.02s system 99% cpu 0.490 total
        time ./wc2 ../wiki-large.txt                     
          854100 17794000 105322200 ../wiki-large.txt
        ./wc2 ../wiki-large.txt  0.28s user 0.01s system 99% cpu 0.293 total
    

Remove unnecessary branching introduced my multi-character handling [1]. This
actually resembles the Go code pretty closely. We get a speedup of 1.8x.:

    
    
        $ time ./wc3 ../wiki-large.txt
          854100 17794000 105322200 ../wiki-large.txt
        ./wc3 ../wiki-large.txt  0.25s user 0.01s system 99% cpu 0.267 total
    

If we take the second table from the article and divide the C result (5.56) by
1.8, the C performance would be ~3.09, which is faster than the Go version
(3.72).

Edit: for comparison, the Go version from the article:

    
    
        $ time ./wcgo ../wiki-large.txt
          854100 17794000 105322200 ../wiki-large.txt
        ./wcgo ../wiki-large.txt  0.32s user 0.02s system 100% cpu 0.333 total
    

So, when removing the multi-byte character white space handling, the C version
is indeed faster than the (non-parallelized Go version).

[1]
[https://gist.github.com/danieldk/f8cdaed4ba255fb2954ded50dd2...](https://gist.github.com/danieldk/f8cdaed4ba255fb2954ded50dd2931ed)

~~~
tom_mellior
Thanks, I finally understood what you are saying. Indeed, the code uses
iswspace to test all characters, wide or normal. Strange design choice. For
whatever it's worth, even just changing

    
    
        if (iswspace(wch))
    

to something like

    
    
        if (domulti && iswspace(wch))
            ...
        else if (!domulti && isspace(wch))
            ...
    

got something like a 10% speedup on my machine. And replacing isspace with an
explicit condition like yours is _much_ faster still. I checked, isspace is
macro-expanded to a table lookup and a mask, but apparently that's still
slower than your explicit check. I'm a bit surprised by this but won't
investigate further at the moment.

~~~
danieldk
_Thanks, I finally understood what you are saying._

I am sorry for the unclear comments. I'll stop commenting on a phone ;).

 _Indeed, the code uses iswspace to test all characters, wide or normal.
Strange design choice._

I agree, it's really strange. This seems to be inherited by the FreeBSD
version, which still does that as well:

[https://github.com/freebsd/freebsd/blob/8f9d69492c3da3a8c1ea...](https://github.com/freebsd/freebsd/blob/8f9d69492c3da3a8c1ea7fa1bc82b7639cc3064b/usr.bin/wc/wc.c#L310)

It has the worst of both worlds: it incorrectly counts the number of words
when there is non-ASCII whitespace (since _mbrtowc_ is not used), but it pays
the penalty of using _iswspace_. It's also not in correspondence with POSIX,
which states:

 _The wc utility shall consider a word to be a non-zero-length string of
characters delimited by white space.

[...]

C_CTYPE

Determine the locale for the interpretation of sequences of bytes of text data
as characters (for example, single-byte as opposed to multi-byte characters in
arguments and input files) and which characters are defined as white space
characters._

------
z3t4
This is excellent for showing how to do X in Y language. But like most
benchmarks - it's a bit silly. If speed was the priority, you could parse only
1% of the file, then multiply the values by 100, assuming the rest of the file
looks alike. Do you really care to know the file has 1337 words rather then ca
1300 words?

What about power consumption?

