
Parallel Programming in Multicore OCaml - pjmlp
https://github.com/prismlab/parallel-programming-in-multicore-ocaml/tree/draft
======
sadiq
(Multicore OCaml hacker here - speaking for myself though)

Nice to see people having a look at this. It's an early draft chapter that
Sudha, one of the Multicore team, has been working on.

If you have feedback or suggestions I'm sure she'd welcome it on the draft's
PR: [https://github.com/prismlab/parallel-programming-in-
multicor...](https://github.com/prismlab/parallel-programming-in-multicore-
ocaml/pull/1)

If you're new to Multicore OCaml the repo is at [https://github.com/ocaml-
multicore/ocaml-multicore](https://github.com/ocaml-multicore/ocaml-multicore)
. It has installation instructions near the bottom (easiest way is via OPAM).

The latest June update of our progress is available at
[https://discuss.ocaml.org/t/multicore-ocaml-
june-2020/6047](https://discuss.ocaml.org/t/multicore-ocaml-june-2020/6047)

For a technical deep dive on how Multicore OCaml works our recent ICFP paper
has a lot of detail:
[https://arxiv.org/abs/2004.11663](https://arxiv.org/abs/2004.11663)

------
Athas
Some months ago, I used my lockdown time to make a repository containing
implementations of a (bad) ray tracer in various parallel functional
programming languages[0]. I was quite happy that someone contributed a
multicore OCaml implementation[1]. I think the most interesting part was the
series of pull requests where an expert (I think even a multicore OCaml
contributor) significantly improved the performance of the ray tracer [2,3,4].
I think being able to see the individual steps is valuable, because functional
programming has long promised "effortless parallelism", and hence it is
enlightening to see how different well-optimised parallel functional code is
from an original more or less idiomatic sequential implementation. In OCaml's
case, I don't think the difference is all that great. Most of the improvements
came from using a library that provides a proper task pool, since "multicore
OCaml" on its own seems to just provide relatively low-level (but very well
implemented) primitives. It's a little bothersome that the programmer has to
manually provide a chunk size, rather than letting the runtime figure such
things out on its down, but it's not terrible. It was far more annoying that
multicore OCaml is (or was) missing functions for measuring wall clock time...

Although there was a later PR that improved performance 40% just by moving
code around to minimise allocations... [5] Unfortunate that such minor changes
can have such dramatic impact.

[0]:
[https://github.com/athas/raytracers](https://github.com/athas/raytracers)
[1]:
[https://github.com/athas/raytracers/tree/master/ocaml](https://github.com/athas/raytracers/tree/master/ocaml)
[2]:
[https://github.com/athas/raytracers/pull/4](https://github.com/athas/raytracers/pull/4)
[3]:
[https://github.com/athas/raytracers/pull/5](https://github.com/athas/raytracers/pull/5)
[4]:
[https://github.com/athas/raytracers/pull/6](https://github.com/athas/raytracers/pull/6)
[5]:
[https://github.com/athas/raytracers/pull/7](https://github.com/athas/raytracers/pull/7)

------
xvilka
Yes, there was also a recent report [1] about the current state of Multicore.
Apart from some unsolved problems the two largest problem now are 1) it's not
in the mainline yet 2) lack of support in standard/third-party libraries.
Sadly, it might take quite a while for ecosystem to catch up. Haskell has a
similar problem with Linear Types[2][3].

[1] [https://discuss.ocaml.org/t/multicore-ocaml-
june-2020/6047](https://discuss.ocaml.org/t/multicore-ocaml-june-2020/6047)

[2] [https://gitlab.haskell.org/ghc/ghc/-/wikis/linear-
types](https://gitlab.haskell.org/ghc/ghc/-/wikis/linear-types)

[3] [https://github.com/tweag/linear-base/](https://github.com/tweag/linear-
base/)

------
wk_end
> The Multicore OCaml compiler comes with two variants of Garbage Collector,
> namely a concurrent minor collector (ConcMinor) and a stop-the-world
> parallel minor collector (ParMinor). Our experiments have shown us that
> ParMinor performs better than ConcMinor in majority of the cases. ParMinor
> also does not need any changes in the C API of the compiler, unlike
> ConcMinor which breaks the C API. So, the consensus is to go forward with
> ParMinor during up- streaming of the Domains-only Multicore.

Is there any use for ConcMinor at this point or is it at a dead-end?

~~~
logicchains
>ParMinor performs better than ConcMinor in majority of the cases

Does this just mean better throughput, or also better latency?

~~~
gadmm
Regarding latency, beyond the kind of benchmarks proposed in the paper (from
which this performance claim comes), keep in mind that the behaviour is
qualitatively different between the two alternatives.

Systems programmers in OCaml can apply techniques to achieve a low latency,
but these are unlikely to scale as well under ParMinor. Indeed, with the
latter, it is not going to be possible to have a low-latency domain and a non-
low-latency domain coexist: the whole program has to be written in the low-
latency style.

This is clear from the qualification “stop-the-world” for ParMinor, but worth
to note for people who would read the claim out of context. There can be
further qualitative differences in the GC design that are not measured by the
benchmarks.

~~~
kcsrk
> but these are unlikely to scale as well under ParMinor

To be clear, the said systems programs are not going to see slowdown when run
in a sequential setting (which is what I expect the majority of use cases will
be). It is unlikely that these systems programs will immediately want to take
advantage of parallelism. Moreover, Multicore OCaml aims to add support for
shared-memory parallel programming to increase throughput of the program.
Getting high throughput and maintaining very low latency is a big challenge
generally, and can't just be solved by the runtime system. It needs a
different way of writing programs altogether.

> Indeed, with the latter, it is not going to be possible to have a low-
> latency domain and a non-low-latency domain coexist: the whole program has
> to be written in the low-latency style.

This is wrong. If the low-latency code is written as it is (no allocations),
then in ParMinor the low-latency domain will be responsive at the cost of
increasing the latency on non-low-latency domain. Of course, no GC safe points
should be inserted in those low-latency code. This design is not ossified.

Another details to remember is that ParMinor is a stop-the-world parallel
minor collector. The major collection is concurrent mark-and-sweep which keeps
the latency low. I'd recommend checking out the new concurrent GC for GHC
which uses concurrent collection for the major heap, and parallel collection
for the minor heap [1], which is similar to ParMinor.

Ultimately, Multicore OCaml aims to offer an easy way of helping 95% of the
programs to take advantage of parallel execution for increasing throughput
without compromising latency. It is quite hard to fully support different
expert cases especially since they rely on the details of the existing runtime
system, which may no longer hold when the runtime system changes.

[1]
[https://dl.acm.org/doi/pdf/10.1145/3381898.3397214](https://dl.acm.org/doi/pdf/10.1145/3381898.3397214)

~~~
gadmm
Indeed, OCaml multicore preserves the sequential low-latency setting, and this
is important. Generalising this style is indeed a challenge, and I believe
that there is something to learn from those expert cases.

I do not find realistic the suggestion that low-latency code could simply
avoid safe points. To get a low latency, one in general avoids promoting
values, but one still allocates on the minor heap. One can achieve much less
without allocating at all, and moreover OCaml multicore adds extra safe points
for a reason, even if they could be turned off.

In the end, what you write about details of the runtime system illustrates my
point: the two minor collectors have qualitative differences regarding
latency, that affect available programming styles, and this is not measured by
the benchmarks.

------
badfrog
How is Domain different/better than Deferred?
[http://dev.realworldocaml.org/concurrent-
programming.html](http://dev.realworldocaml.org/concurrent-programming.html)

~~~
sadiq
Domain is a unit of parallelism in Multicore OCaml - it's effectively a
heavyweight thread. The intention is that you tend to have as many Domains as
you have cores you want to use in the computation.

Deferred relates to concurrency rather than parallelism, that you can have
multiple overlapping computations. You can have concurrency without
parallelism. OCaml has ways you can have parallel computations going on that
don't hold the interpreter lock and I think some of the concurrent libraries
can utilise this.

Multicore OCaml splits parallelism and concurrency, the former is via Domains
and the latter is with Fibers. The paper I linked in my other comment on this
thread touches on this a little but kc also has a short write-up of how you
can use effects to write a scheduler for multicore's fibers:
[https://kcsrk.info/ocaml/multicore/2015/05/20/effects-
multic...](https://kcsrk.info/ocaml/multicore/2015/05/20/effects-multicore/)

------
sigrlami
Is C Api changes stable? Last time I checked there was some breaking changes

~~~
kcsrk
No breaking changes with the parallel minor collector (ParMinor). This is the
design which will be upstreamed. See
[https://arxiv.org/pdf/2004.11663.pdf](https://arxiv.org/pdf/2004.11663.pdf)
for details.

~~~
sigrlami
Thanks! Looks very promising

