
Beating C with Futhark Running on GPU - Athas
https://futhark-lang.org/blog/2019-10-25-beating-c-with-futhark-on-gpu.html
======
raphlinus
Monoid homomorphisms for the win. I discussed a very similar approach for
computing the length of the longest line in a "rope science" article[1], as
well as unquoting strings[2]. In this case, it was very nice to see actual
code, and Futhark looks like a good language. For the string unescaping, I
used Nvidia's Thrust, which is a templated C++ library; generally pretty
similar to the Futhark code, and generally similar results.

[1]: [https://xi-editor.io/docs/rope_science_01.html](https://xi-
editor.io/docs/rope_science_01.html)

[2]: [https://raphlinus.github.io/personal/2018/04/25/gpu-
unescapi...](https://raphlinus.github.io/personal/2018/04/25/gpu-
unescaping.html)

------
Herrin
I found this more readable and understandable than the Haskell post, although
I can't quite say why. It might simply be the repetition.

I'm really interested in Futhark, though I haven't found a project where it
would be make sense to use it. But I feel like it has the same potential to
make GPU programming not feel overwhelming the same way Elm did with frontend
work for me.

~~~
mjaniczek
Since Futhark has sum types, I wonder whether we could transpile Elm syntax to
Futhark. I'll have to dig into what's possible with Futhark and how well it
would map...

~~~
earth_walker
James Carson gave a talk at this year's Elm Conf on using Elm to talk to
Futhark:

[https://www.youtube.com/watch?v=FVP8zxpZKV8](https://www.youtube.com/watch?v=FVP8zxpZKV8)

------
e12e
> Word counting is primarily IO-bound, and it is much too expensive to ferry
> the file contents all the way to the GPU over the (relatively) slow PCI
> Express bus just to do a relatively meagre amount of computation.

After seeing that it's possible to play crysis using software rendering on an
AMD Rome cpu with 128 hw threads [1] - might this lead to some vindication for
AMD sticking with opencl (assuming exposing such a cpu via opencl) - or is it
just simpler to ignore that (in general and for futark) and just use regular
threads for parallelizing aacross many cpu cores?

[1]
[https://news.ycombinator.com/item?id=21339652](https://news.ycombinator.com/item?id=21339652)

~~~
Mathnerd314
CUDA is proprietary to NVidia, and is pretty much the standard for GPU
computing. AMD's been chipping away with OpenCL, Vulkan/GLSL,
[https://github.com/RadeonOpenCompute/hcc/wiki](https://github.com/RadeonOpenCompute/hcc/wiki),
etc. but not much luck so far. I wouldn't say AMD's been "sticking with"
OpenCL, if anything it seems like they will deprecate it in a few years, as
the plan is to fold OpenCL into Vulkan.

I guess it is possible to use OpenCL on the CPU as well, but it seems to be
intended mostly for testing purposes. The Crysis software renderer uses
threads:
[https://github.com/google/swiftshader/blob/master/src/Common...](https://github.com/google/swiftshader/blob/master/src/Common/Thread.hpp)

~~~
rrss
Last I checked, AMD implemented the cuda apis as "hip."

------
tom_mellior
I'm starting to sound like a broken record on this, but if you're going to
compare to your system wc without trying to figure out if it was compiled with
-O3 [EDIT: and what source code it was compiled from], you haven't shown
anything in the sequential case.

What this article does show is that Futhark really does allow one to express
this in a much simpler way than Haskell.

~~~
Athas
That's a good point, but the -O3 doesn't actually do a whole lot here. I
recompiled the Futhark-generated C code with just -O and performance was
unchanged. If you look at the generated C code, there isn't really a lot to do
either:
[https://gist.github.com/athas/7c8ffc2620a9406e4bbb0df89f2fc9...](https://gist.github.com/athas/7c8ffc2620a9406e4bbb0df89f2fc9f6)

I hope I can assume that RHEL compiles their wc with at least -O.

~~~
tom_mellior
> That's a good point, but the -O3 doesn't actually do a whole lot here.

True, maybe it's not about -O3 but about some factor in the unknown source
code of the system wc. I did compile one version of wc with -O3 and it beat my
system wc (Ubuntu) by 2x:
[https://news.ycombinator.com/item?id=21271951](https://news.ycombinator.com/item?id=21271951)

~~~
Athas
Honestly, I would expect the main reason my wc is faster is that mmap()ing the
file and then reading it in a huge chunk is about as fast as the kernel's IO
can go. GNU wc cannot do this in general because it's supposed to work on
pipes as well, and I doubt anyone cared enough about the tiny performance
difference to exploit the case where the input file is mmap()able.

(I had actually hoped Futhark would be slower sequentially, just so this
wouldn't be the focus of the discussion!)

~~~
tom_mellior
> GNU wc cannot do this in general because it's supposed to work on pipes as
> well

I used the "reference" source code linked from the original Haskell post, a
BSD version hosted by Apple:
[https://opensource.apple.com/source/text_cmds/text_cmds-68/w...](https://opensource.apple.com/source/text_cmds/text_cmds-68/wc/wc.c.auto.html)

It uses raw read() from a file descriptor and works with pipes as well. I
think the only special handling for stdin vs. an actual file it has is calling
fstat() if _only_ the number of characters is requested, which shouldn't apply
here.

So yes, this version does need to do more complicated I/O than a simple
mmap(). And (broken record, but I'll stop after this) it's 2x as fast as my
system's GNU wc (when compiled with -O3 vs. however the system wc was
compiled).

> I had actually hoped Futhark would be slower sequentially

It might still turn out to be, if you see if you can get a faster C version of
wc.

~~~
YSFEJ4SWJUVU6
>It might still turn out to be, if you see if you can get a faster C version
of wc.

You definitely can, at least if you allow manually vectorized code.

On my system, with a 1.661GB file (256 times big.txt from the original Haskell
post) GNU wc takes about 6.5s (real time), a stripped down version of Apple's
implementation about 4.1s, and a single-threaded vectorized wc (written in C)
only 0.27s. (These times are of course only with a hot cache. For reference,
catting the same file to /dev/null takes about 0.18s.)

edit: corrected the time for the BSD-derived implementation

~~~
ummaycoc
Would you mind sharing the 0.27s version?

~~~
YSFEJ4SWJUVU6
See [https://git.io/JeEjB](https://git.io/JeEjB)

------
yiyus
Very honest discussion of the results. I liked it.

Would it be possible to use Futhark to rewrite the APL implementation instead
of the Haskell one? That would make an interesting comparison.

~~~
Athas
> Would it be possible to use Futhark to rewrite the APL implementation
> instead of the Haskell one? That would make an interesting comparison.

Sadly, from what I can see, the APL version makes use of so-called nested
arrays in the 'words' function, specifically arrays of strings (this is
different from multidimensional arrays). Futhark does not directly support
nested arrays. A rewrite of the APL implementation would require using a quite
different algorithm (or a nontrivial encoding).

But my APL is a bit rusty, so I may be wrong.

~~~
ummaycoc
In my original version I didn't use a nested array of strings, but
[Olzd]([https://news.ycombinator.com/user?id=olzd](https://news.ycombinator.com/user?id=olzd))
pointed out a way you could do that and I added it as a theoretical first
attempt (theoretical since it wasn't my first attempt, but would have been had
I been better at APL).

My first attempt did some stuff with subtracting items in an array from their
neighbor. Now with info I got from
[mlochbaum]([https://news.ycombinator.com/user?id=mlochbaum](https://news.ycombinator.com/user?id=mlochbaum))
I have another version that uses windowed reductions.

So those are three versions there; after that I just split it up just to see
where that leads and that actually ends up feeling a lot like the Haskell /
Futhark solution to me.

------
ngcc_hk
No time working on my I Ching course for my master degree. However I join a
programming competition many years ago About an incident of a crazy assignment
for a kid. My clone of the github here:

[https://github.com/kwccoin/ABCDEFGHPPP](https://github.com/kwccoin/ABCDEFGHPPP)

It is fun. I think I even try to use a micro version of cobol to do it. But
the fastest is still c.

May be we can start one. Sadly no time to join.

(I think there is a web site that post many versions of the same program. It
must have Wc. If not these should be there. )

------
hellofunk
A lot of the Futhark demos you see are rather basic algorithms like matrix
multiplication, and the documentation for Futhark does say that it is not
well-suited to complex kernels, so that to me puts a big limiting factor on
how useful it could be to invest in it.

I really like technologies like this and Sycl which aim to greatly simplify
the process of writing GPU code. The important thing is that it can handle
what you'd throw at it as if you were writing directly in Metal, Cuda, OpenCL,
and I don't think that is the case (yet?) with Futhark.

~~~
Athas
It depends on what you consider a "complex kernel". Futhark is only for
regular non-recursive data parallelism, but I'll argue that something like a
genetic algorithm that does calibration of market parameters in the Heston
model[0] is pretty complex. It comprises multiple levels of parallelism and
several kernels (last I checked, the core work is done in four kernels which
are invoked in a loop).

But more importably, this benchmark is written as a composition of two
reusable parts (a genetic algorithm that is parametric in its objective
function, and a specific objective function that does option pricing) that are
then put together in an efficient and automatic way by the compiler. You
literally _could not_ write it this way in OpenCL or CUDA (modulo extreme
amounts of template metaprogramming in the latter). While you could certainly
write a specialised GPU program that did exactly this calibration, and
probably outperform Futhark, you would not be able to structure it as reusable
components without significant performance loss. This, I think, is the main
advantage of using a high-level language together with an optimising compiler.

[0]: [https://github.com/diku-dk/futhark-
benchmarks/tree/master/mi...](https://github.com/diku-dk/futhark-
benchmarks/tree/master/misc/heston)

~~~
hellofunk
Thank you for this thoughtful reply, really appreciate it.

------
sword_smith
This sounds smart. I haven't programmed Futhark yet, though, but I really
enjoy functional programming. Where is this language primarily used, at
universities or also in industry?

~~~
olodus
I got to use it in university in a course about parallel functional
programming. I think it is still mostly a research language, but if you need
to do computation on the GPU and like func prog then it probably would be an
interesting alternative to try out. It really does a great job of optimizing
and parallizing you program.

~~~
sword_smith
Which university? :)

~~~
olodus
Chalmers University of Technology, in Gothenburg. Though Futhark is developed
in Copenhagen I think. We had a guest lecture with one of the developers of
the language. Was a really interesting lecture together with some exercise for
us to play with the Lang. Though it has gotten even better since then from
what I've seen in their blog.

~~~
sword_smith
I wonder if this language has spread beyond Europe yet. Do any Americans use
it?

------
jacquesm
Beating CPU with GPU would have been a better title.

~~~
Svip
I think the title is a reference to the two previous articles that inspired
it. Like the old 'X considered harmful' format.

------
jsd1982
One could easily amortize the startup cost by putting the backend logic into a
background service where it only starts up once and continues running. A
frontend program like `wc` would just forward requests to the backend service.

------
tempodox
A nice, unobtrusive way of showing monads at work.

~~~
tathougies
These are monoids, not monads.

------
makz
GPU supremacy

------
_wldu
C is the Mike Tyson of programming languages. There will never be another like
it. It's simple, dangerous and fast. You can't beat C, but everyone will keep
trying. It may beat itself in the end though as it's too rough for the modern
world.

~~~
OskarS
I mean, Futhark certainly can. The whole point of Futhark is that it's a
functional language that can run on GPUs. Futhark will beat the pants off of C
for most any problem that is suitable for GPU computation, even if written in
an entirely functional style.

For CPUs, I'd like to introduce you to my good friend Fortran.

~~~
gridlockd
That's an apples to oranges comparison, because Futhark needs to compete with
C on the GPU (CUDA/OpenCL), not C on the CPU.

That's what I'm missing from these benchmarks - how does it fare against a
handwritten, competent implementation in those languages?

~~~
OskarS
In the linked post, the author tests Futhark running on the CPU without any
parallelism or low-level optimizations, and it still beat GNU wc.

Fair enough about apples to oranges, but it was really distasteful to me that
the top comment on this post was about how C is “unbeatable”, when the article
clearly showed that Futhark was faster. And since we’re at the dawn of a new
age of parallel computing, statements like that are absurd in general.

~~~
mlyle
> In the linked post, the author tests Futhark running on the CPU without any
> parallelism or low-level optimizations, and it still beat GNU wc.

Barely...

It also mmaps in a whole 100 meg file. :P wc is not optimized to count as fast
as possible, resource use be damned.

