
OCaml 4.03 will, “if all goes well”, support multicore - wting
https://sympa.inria.fr/sympa/arc/caml-list/2015-05/msg00034.html
======
avsm
If you'd like to see the approach we're taking at OCaml Labs in order to build
multicore, read KC's blog post here:

[http://kcsrk.info/ocaml/multicore/2015/05/20/effects-
multico...](http://kcsrk.info/ocaml/multicore/2015/05/20/effects-multicore/)

The core idea is incredibly exciting (to us, anyway). Rather than baking in a
specific multicore scheduler, we're allowing pluggable schedulers written in
OCaml. They use algebraic effects to allow an independent scheduler to compose
concurrency among OCaml threads. This will ensure that the OCaml runtime
remains lean, and even allow applications to define their own strategies for
concurrent scheduling.

------
DonPellegrino
More information is available

in my original post in r/ocaml:
[https://www.reddit.com/r/ocaml/comments/36ninh/403_scheduled...](https://www.reddit.com/r/ocaml/comments/36ninh/403_scheduled_for_the_end_of_the_year_if_all_goes/)

in the repost in r/programming:
[https://www.reddit.com/r/programming/comments/36ppx0/ocaml_4...](https://www.reddit.com/r/programming/comments/36ppx0/ocaml_403_will_if_all_goes_well_support_multicore/)

~~~
DonPellegrino
For those asking "How the hell does OCaml not support multicore in 2015????",
this is my reply, crossposted from /r/ocaml:

You can make OS level threads, but they can't be both running at the same time
due to the GIL (Global Interpreter Lock). Then why are they even there you
might ask? Because it allows you to do a blocking call on a thread and to keep
executing other stuff in the main thread. Other languages that have a GIL (and
the same restriction) are Javascript (including Node.js), Ruby and Python.

Now, IN PRACTICE, things are a bit different. You're never gonna make your own
thread to block on things. You're gonna use Lwt to manage all your concurrency
so you can do tons of blocking stuff at the same time and combine the tasks
nicely without ending up in a Node.js-style "callback hell".

But still, even with tons of concurrency, you don't have parallelism. It's all
you need for 98% of your programs, but if you then need to do heavy number-
crunching it won't be enough. This is the exact same situation that happens in
Node.js, Python, etc, except that OCaml is massively faster than those
languages, so even some CPU-bound work is acceptable because OCaml is really
performant.

Currently, there's 2 options if you wanna do CPU-bound work: you can use
ctypes to call C code easily (from Lwt_preemptive) and then release the lock
from within C with caml_release_runtime_system(), so your C code will be truly
parallel (and running in the thread pool automatically managed by
Lwt_preemptive), and you can call caml_acquire_runtime_system() before
returning the result back to OCaml to get the lock back and merge back with
the normal code.

The second option is to do an oldschool fork() and communicate with message-
passing. Or have a master that manages workers and communicates with ZMQ,
HTTP, TCP, IPC, etc. Or use a library that does it all for you like parmap,
Async Parallel, etc etc.

What this "multicore support" means is that you'll be able to have threads in
the same process that run in parallel because the GIL is going away. In
practice it'll probably be implemented directly into Lwt so you'll be able to
do something with Lwt_preemptive and just tell it to run some function in a
separate thread and then use >>= to handle its result. It's gonna be simpler
than both options I described above.

Again, more technical information is available in my r/ocaml post

~~~
jwatzman
> The second option is to do an oldschool fork() and communicate with message-
> passing. Or have a master that manages workers and communicates with ZMQ,
> HTTP, TCP, IPC, etc. Or use a library that does it all for you like parmap,
> Async Parallel, etc etc.

I work on the Hack language typechecker at Facebook. The typechecker is
written in OCaml, and since it needs to operate on the scale of Facebook's
codebase (tens of millions of lines of code), it's a pretty performance-
sensitive program. We needed real parallelism, but doing it with fork() and
IPC was too costly for us, both in terms of storage (if you aren't careful you
end up duplicating a bunch of data) and CPU (serializing/deserializing OCaml
data structures to send over IPC is CPU-intensive).

We ended up doing something somewhat more interesting. Before we fork(), we
mmap a MAP_ANON|MAP_SHARED region of memory -- that region will be backed by
the same physical frames in each child after we fork, so writes to it in one
child process will be visible in the others. We use a little bit of C code to
safely manage the shared-memory concurrency here.

The code for this all open source (along with the rest of the typechecker,
HHVM runtime, etc) if you want to take a look:
[https://github.com/facebook/hhvm/blob/master/hphp/hack/src/h...](https://github.com/facebook/hhvm/blob/master/hphp/hack/src/heap/hh_shared.c)

I also gave a tech talk a while ago on internals of the type system and
typechecker; the latter part starts here:
[https://www.youtube.com/watch?v=aN22-V-b8RM&feature=youtu.be...](https://www.youtube.com/watch?v=aN22-V-b8RM&feature=youtu.be&t=39m)

~~~
AceJohnny2
> We ended up doing something somewhat more interesting. Before we fork(), we
> mmap a MAP_ANON|MAP_SHARED region of memory -- that region will be backed by
> the same physical frames in each child after we fork, so writes to it in one
> child process will be visible in the others. We use a little bit of C code
> to safely manage the shared-memory concurrency here.

Isn't that similar to how Linux implemented threads for a long time (before
NPTL [1]) ?

I vaguely recall that for a long time people were complaining about the cost
of starting threads in Linux, because it basically amounted to fork()+shared
memory.

[1]
[http://en.wikipedia.org/wiki/Native_POSIX_Thread_Library](http://en.wikipedia.org/wiki/Native_POSIX_Thread_Library)

~~~
jwatzman
I don't know the history of threads/NPTL on Linux. However, the distinction
between "thread" and "process" in the Linux kernel is mostly a human one, not
a technical one. Take a look at the clone() syscall -- spawning a thread vs.
forking a process amount just to different flags to that call, to tell it
whether to copy pages or not, how to assign a new ID to the new
thread/process, etc. (Not sure if that's how fork() and friends are actually
implemented under the hood.)

~~~
cbd1984
> (Not sure if that's how fork() and friends are actually implemented under
> the hood.)

If you use strace you can see that it is. fork(2) and pthread_create(3) both
show up as calls to clone(2).

~~~
cbd1984
Just for the edification of the masses:

pthread_create(3) looks something like this:

    
    
        clone(child_stack=0x7f79d754bff0,
        flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID,
        parent_tidptr=0x7f79d754c9d0,
        tls=0x7f79d754c700, child_tidptr=0x7f79d754c9d0) = 31230
    

(Newlines are mine.)

Of course, those pointers are only that size on a 64-bit architecture. The
flags are where the real point of interest is.

fork(2) is like this:

    
    
        clone(child_stack=0,
        flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
        child_tidptr=0x7f543a9aaa10) = 17978
    

When implementing fork(2), the return value from clone(2) is the child's PID
from the context of the parent process. When implementing pthread_create(3),
the return value for the parent is still an integer value which is unique to
the thread, and which strace uses as if it were a PID when it's tracing down
the system calls of individual threads in separate files, which strace can do
because it's awesome.

Some more information:

> Linux has a unique implementation of threads. To the Linux kernel, there is
> no concept of a thread. Linux implements all threads as standard processes.
> The Linux kernel does not provide any special scheduling semantics or data
> structures to represent threads. Instead, a thread is merely a process that
> shares certain resources with other processes. Each thread has a unique
> task_struct and appears to the kernel as a normal process (which just
> happens to share resources, such as an address space, with other processes).

[http://www.makelinux.net/books/lkd2/ch03lev1sec3](http://www.makelinux.net/books/lkd2/ch03lev1sec3)

------
tangled
It's interesting to read Xavier's annual statement on why there will never be
multi core support in OCaml:
[http://mirror.ocamlcore.org/caml.inria.fr/pub/ml-
archives/ca...](http://mirror.ocamlcore.org/caml.inria.fr/pub/ml-
archives/caml-list/2002/11/64c14acb90cb14bedb2cacb73338fb15.en.html)

~~~
trentnelson

        > To make things worse, non-blocking I/O is done completely differently
        > under Unix and under Win32.  I'm not even sure Win32 provides enough
        > support for async I/O to write a real user-level scheduler.
    

_sigh_ , VMS got the link between processes, threads, I/O and waitable events
(specifically, the link between tying the completion of future I/O to
subsequent computation) right from day one. And by virtue of Cutler,
therefore, so did NT, and thus, Windows.

UNIX did not. The core concept of separating the work (computation to be done
after an event occurs) from the worker[1] (the thread that performs the work)
is absent; the manifestation of that is the lack of good, completion-oriented
asynchronous I/O primitives. Instead of being able to say to the kernel "here,
do this, then let me know when you're done"[2] and moving on to the next piece
of work in the queue, you have to do the elaborate non-blocking multiplex
dance for socket I/O, palm file I/O off onto a separate set of threads that
can block (or do AIO) and generally manage all threading and concurrency
primitives yourself.

It took me ten years of UNIX systems programming to suddenly grasp the
elegance of the VMS/NT/Windows approach a few years ago. It provides you with
_everything_ you need to optimally exploit all your cores for work that is
both heavily compute bound _and_ I/O bound.

It has been fascinating to see the difference in performance between Linux and
Windows in practice with PyParallel when Windows kernel primitives are
exploited properly:

[https://speakerdeck.com/trent/pyparallel-
pycon-2015-language...](https://speakerdeck.com/trent/pyparallel-
pycon-2015-language-summit?slide=5).

And more recently, with 10Gbe hardware at home:

Linux lwan (the top performer on Techempower Framework Benchmark):

    
    
        [trent@zebra/ttypts/1(~s/wrk)%] time ./wrk --timeout 120 --latency -c 256 -t 12 -d 30 http://10.0.0.2:8080/plaintext
        Running 30s test @ http://10.0.0.2:8080/plaintext
          12 threads and 256 connections
          Thread Stats   Avg      Stdev     Max   +/- Stdev
            Latency     5.34ms    7.46ms 197.13ms   82.40%
            Req/Sec    14.41k   364.49    18.82k    76.61%
          Latency Distribution
             50%  398.00us
             75%    9.01ms
             90%   17.50ms
             99%   28.03ms
          5178617 requests in 30.10s, 0.93GB read
        Requests/sec: 172048.49
        Transfer/sec:     31.67MB
    

Windows PyParallel:

    
    
        [trent@zebra/ttypts/1(~s/wrk)%] time ./wrk --timeout 120 --latency -c 256 -t 12 -d 30 http://10.0.0.2:8080/plaintext
        Running 30s test @ http://10.0.0.2:8080/plaintext
          12 threads and 256 connections
          Thread Stats   Avg      Stdev     Max   +/- Stdev
            Latency     1.52ms    9.38ms 492.43ms   99.33%
            Req/Sec    18.37k     1.01k   22.75k    73.50%
          Latency Distribution
             50%    1.09ms
             75%    1.28ms
             90%    1.56ms
             99%    5.18ms
          6598900 requests in 30.10s, 1.03GB read
        Requests/sec: 219236.69
        Transfer/sec:     34.92MB
        ./wrk --timeout 120 --latency -c 256 -t 12 -d 30   106.30s user 138.87s system 814% cpu 30.114 total
    
    

[1]: [https://speakerdeck.com/trent/parallelism-and-concurrency-
wi...](https://speakerdeck.com/trent/parallelism-and-concurrency-with-
python?slide=27)

[2]: [https://speakerdeck.com/trent/pyparallel-how-we-removed-
the-...](https://speakerdeck.com/trent/pyparallel-how-we-removed-the-gil-and-
exploited-all-cores?slide=52)

~~~
gtk40
Would you mind explaining what the link between NT and VMS is?

~~~
trentnelson
The principle architect of VMS was David Cutler, purportedly the best engineer
at Digital at the time (80s), and best OS designer in the industry.

Digital dropped the ball in the late 80s with regards to management of Cutler
and his team, canceling his PRISM project and leaving him and his team
disgruntled.

Elsewhere in Seattle, a chap named Bill Gates was flush with billions of cash
and knew that the shelf life of DOS was limited; if Microsoft were to succeed,
they needed a new, robust, reliable and high-performance OS that they could
"bet the company on".

Gates got word that Cutler was disgruntled at Digital, and a mutual party set
up a meeting. Cutler was dismissive of Microsoft's technology stack at the
time (DOS and some office apps) -- he was a hardcore OS engineer, and DOS was
a toy.

Gates persisted, ensuring Cutler that he would have the opportunity to build
the next generation of OS from the ground up and essentially unlimited
resources at his disposal to do it. Cutler eventually agreed, and the NT
kernel project was born.

[http://www.amazon.com/Show-Stopper-Breakneck-Generation-
Micr...](http://www.amazon.com/Show-Stopper-Breakneck-Generation-
Microsoft/dp/0029356717/ref=sr_1_12?ie=UTF8&qid=1432232814&sr=8-12&keywords=show+stopper)

[http://windowsitpro.com/windows-client/windows-nt-and-vms-
re...](http://windowsitpro.com/windows-client/windows-nt-and-vms-rest-story)

~~~
nbevans
Dave Cutler is the real stuff of legends. Obviously this is my opinion but I
admire him and his work far far far FAR more than anything Linus Torvalds has
done.

Arguably, Linus' greatest work was Git, not Linux. Linux is, architecturally,
a piece of shit! Actually, wait, so is Git. Mercurial does everything Git does
and does it far better and more elegantly. So yeah, wait... one wonders where
Linus gets all his fanatics from!

~~~
trentnelson
Heh, after reading Show Stoppers and Just For Fun, I actually think Cutler and
Linus are actually very similar and would potentially get along in real life
if it weren't for the epic technology divide.

------
dorfsmay
I took a really hard look at OCaml a year ago, as I was running into
performance issues with python. Lack of multicore support made me give up on
it.

Now that Rust is around and supporting multicore, that's probably where I'll
be investing my time.

I'd love to hear feedback from people who have used both Rust and OCaml.

~~~
WaxProlix
I'm in the same place sort of, and Golang and Julia seem like the obvious
higher-performance transitions from Python. What about Rust makes you consider
it so highly?

~~~
iamd3vil
Take a look at Nim Language. (www.nim-lang.org) A High Performance lang with
"Pythonesque" syntax.

~~~
WaxProlix
I've seen Nim going back to when it was Nimrod, but my impression (correct it
here if needed please) of their development cycle isn't especially high. I do
like the syntax and the inherent 'threadiness' of it, for lack of a better
word.

------
nextos
I'm considering OCaml for a new project where C++ would be the typical choice.
Think algorithms handling massive amounts of data, and some numerics.

I have some experience with ML, Haskell & Lisp. OCaml is appealing because it
is quite efficient and predictable. Does it have the bit of laziness Clojure
has that makes functional programming easy with large data?

~~~
gmfawcett
Yes, there is support for laziness (see "streams"). A couple things to keep in
mind: floating point values in Ocaml are boxed (floats are actually pointers
to float data on the heap), and integers are one bit shorter than native types
(31 or 63 bits) due to the way that Ocaml values are tagged internally. The
native compiler generates good, predictable, but fairly simple code: few
optimizations are applied (although there is active work underway, in the
"flambda" project, that will significantly change this). Also, there is of
course a garbage collector, though it is quite efficient in most cases. These
factors may or may not be a performance issue in your own project.

~~~
DonPellegrino
In practice the generated code is already extremely fast and the 1-bit shorter
ints help make the GC one of the fastest I've seen. If you do a lot of
floating point calculations, you can put your floats in an array and they'll
become unboxed.

------
SniperOwl
If Jane Street Capital has it their way, multicore support "will definitely go
well".

~~~
tomjen3
It sounds like such a company might also have the capital to pay people to
make it work.

It is pretty stupid for a language not to have multicore support in 2015.
Javascript has it (in its own, somewhat broken way).

~~~
rubiquity
> _Javascript has it (in its own, somewhat broken way)._

No it doesn't. Also OCaml is from a pre-multicore era. Even Erlang wasn't
multicore from the start, SMP was added in 2005.

~~~
tomjen3
Webworkers

And I don't really care if it is from an other era - C has threads (as a
library but still).

~~~
rubiquity
> _Webworkers_

JavaScript Webworkers can't interact with the page at all, they can only send
messages around. This implementation detail leads me to believe the browser is
probably doing little more than instantiating another JavaScript interpreter
and handling IPC/synchronization for you. This is a far cry from true
parallelism.

Further illustrating that JavaScript doesn't have parallelism is that Node.js
isn't parallel and in fact encourages its users to use process forking
instead.

> _And I don 't really care if it is from an other era - C has threads (as a
> library but still)._

OCaml has threads. It just doesn't have parallel threads (yet). Threads
existed before CPU parallelism so of course C and a bunch of other pre-CPU
parallelism languages have them. The difference is C doesn't have a GIL
whereas OCaml does/did.

------
almosthaskeller
Looked into a static FP language recently. Was torn between OCaml and Haskell.
Leaned more toward Haskell than OCaml. Mainly because OCaml feels like it was
hacked together, with a lot of very strange and inconsistent syntax and poorly
thought out semantics. That said, I haven't chosen either yet, because Haskell
has its own share of oddities that I'm still not comfortable with. But at
least it feels more pure and consistent and well thought out in its syntax and
semantics.

~~~
alextgordon
After many years of struggling with Haskell, I've all but given up on it,
because of the broken record semantics. They seem to be able to find time to
implement monad comprehensions or whatever paper fodder is most in vogue, but
you still can't have two datatypes with the same field name in the same
module. It's not a serious language.

~~~
elihu
That is also true of Ocaml, and probably every other ML derived language that
treats accessor functions as ordinary functions (rather than using the C-style
dot operator). There is a type-directed name resolution proposal that would
remove this limitation in Haskell, but it would probably make the typechecker
a lot more complicated.

The can't-reuse-field-names thing is annoying, but claiming that it "isn't a
serious language" because they made a design choice that doesn't meet your
exact expectations seems kind of closed-minded to me.

~~~
kuschku
Or one could use the lisp-way – and just call the functions <record-
name>-<field-name>.

This avoids those problems.

~~~
creichert
This is what you generally see in larger haskell programs which might have a
name clash (at least it's what I use):

    
    
        data F = F { fName :: String }
        data P = P { pName :: String }

~~~
LeonidasXIV
Yes, this is one of the cases where ML modules are elegant, since you can put
the definition of the type into its own module, so it becomes F.t and P.t and
you avoid field name clashes.

------
feld
Seems like a major feature to put in a point release...

I don't understand why some projects have such bizarre versioning methodology.

~~~
LeonidasXIV
This is not really a point release. OCaml versions a little different, a
version number has three parts: Super-Major.Major.Patch. Super-Major releases
are incredibly rare, the last one was the bump to 4, which was done since the
language now supports GADTs (while staying compatible with OCaml 3.x). I don't
even know what caused the bump from 2.x to 3.00. Then the Major part is a
normal release in which many features may be added. The format is always two
digits, of which the first may as well be a 0. The Patch part is just for
fixes, stuff that was broken and overlooked when the release was done.

So OCaml 4.03.0 is basically 4.3.0 in a Python-esque versioning scheme
(remember how many changes were done between Python 2.2.0 and 2.7.0?).

~~~
feld
Thanks for clearing that up

------
j_baker
Does OCaml not already support multicore? Is concurrency green thread based?
Even at that, there's nothing stopping a user from starting multiple
processes....

~~~
murbard2
Yes, the threads are green threads: only one can run at a time. There's also
an Async framework in Jane Street's Core library and there's LWT.

My understanding is that GC is hard with multithread, particularly in a
functional language where it's going to do some heavy lifting and needs to be
very performant.

~~~
Refefer
That's not really true. Erlang's VM is fantastic at GC with thousands upon
thousands of green processes multiplexed onto the system threads, allowing
soft realtime performance. Similarly, Haskell's Parallel Strategies library
works well with the Parallel GC. Immutability makes this a whole lot easier.

Or were you referring to OCaml in particular?

~~~
tormeh
Erlang uses only actors as concurrency mechanism and exploits that fact by
giving each actor its own heap. So Erlang's GC does not need to accommodate
concurrency, even though Erlang itself does.

------
timruffles
Great! This, plus the lack of libraries, put me off. Will take a new look!

~~~
LeonidasXIV
You'll be delighted to hear that OPAM curently features >800 libraries, too.

------
tempodox
Yupeee!

