
Writing Performance Sensitive OCaml Code - sndean
https://janestreet.github.io/ocaml-perf-notes.html
======
gsg
This is about a decade old, I think. With a substantial optimisation suite
added to ocamlopt since, some of it will be out of date. The advice on
inlining in particular is obsolete.

~~~
harrisi
Hm, could you say why you think that? Looking at the github history it's hard
to tell. The earliest history I can find for this page is from 2013[0], but it
seems to be pretty clearly a transition and not the original writing of it.
However, it's the only thing I can find in terms of aging the document.

[0]:
[https://github.com/janestreet/janestreet.github.com/commit/7...](https://github.com/janestreet/janestreet.github.com/commit/77eea6fbcb122f98a5726c6cb88b5c5922529a6a)

~~~
rbehrends
What gsg has been referring to is the flambda framework [1], which has been
available for a while.

If you're using OPAM, install an flambda-enabled compiler with (e.g.)

    
    
      opam switch install 4.04.1+flambda
    

[1] [https://caml.inria.fr/pub/docs/manual-
ocaml/flambda.html](https://caml.inria.fr/pub/docs/manual-ocaml/flambda.html)

------
pkhagah
> The OCaml garbage collector is a modern hybrid generational/incremental
> collector which outperforms hand-allocation in most cases. Unlike the Java
> GC, which gives GCs a bad name, the OCaml GC doesn't allocate huge amounts
> of memory at start-up, nor does it appear to have arbitrary fixed limits
> that need to be overridden by hand.

From the linked article. Can anyone confirm or deny this? I have hard time
believing this statement.

~~~
jblow
It is simply not true unless you have a pathological idea of "hand-allocation"
(which to be fair, some programmers do program like).

Let me put it this way ... all "garbage collection is fast" claims are saying
the following thing:

"It is faster for the programmer to destroy information about his program's
memory use (by not putting that information into the program), and to have the
runtime system dynamically rediscover that information via a constantly-
running global search and then use what it gleans to somehow be fast, than it
is for the programmer to just exploit the information that he already knows."

It sure sounds like nonsense to me.

~~~
vvanders
Yup, an arena allocator that you drop at the end of an operation will always
be faster than a GC.

~~~
rbehrends
For a modern generational GC, the minor heap works just like an arena
allocator (at least for the data that can be thrown away), so this shouldn't
result in any performance difference (assuming the minor heap is large
enough).

~~~
_yosefk
Isn't it harder to deallocate the data if you actually check that it's unused,
as a GC must, instead of just assuming it?

~~~
barrkel
A GC doesn't check if memory is unused; it looks for memory that's used and
frees whatever is leftover. Having a small fraction of used memory to total
memory is what makes GC cheap (in compute time). It also means that allocating
short-lived memory with abandon is cheap, if that's most of your allocation.

~~~
_yosefk
Got it, an interesting point and it follows that a use case can be engineered
where a GC adds essentially zero overhead over an arena allocator.

But in the general case where some objects are short-lived and others aren't,
surely manually splitting your allocations to malloc for long-lived ones and
some sort of arena_alloc for short-lived ones ought to be faster than
allocating them all in one place, then copying the ones which are still
reachable out of the area reserved for short-lived objects?

(This is not to say a GC-based system will be slower "on average" because
nobody knows what the "average" is. A realistic arena-based system can have
objects most of which are short-lived but some do need to live longer and you
only find out long after they're allocated; in that case, one has to manually
reallocate those objects just like a GC would, and doing it, say, the C++ way
is definitely more bug-prone than GC's bug-free handling of this, and one way
to make it less bug-prone in C++ is to have deeper copies and avoid trying to
minimize copying, and now you might easily be slower than a GC. I'm just
saying that it's very easy to find a case when a system not getting any hints
from the programmer wrt object lifecycles and instead discovering them fully
automatically would be slower than a system which does get these hints. And of
course a GC _can_ provide ways to supply these hints, I'm just not aware of
one which does - perhaps it's avoided on the theory that the GC algorithm
might change and you don't want to make hints which operate in terms not
portable between algorithms a part of your interface.)

~~~
rbehrends
> But in the general case where some objects are short-lived and others
> aren't, surely manually splitting your allocations to malloc for long-lived
> ones and some sort of arena_alloc for short-lived ones ought to be faster
> than allocating them all in one place, then copying the ones which are still
> reachable out of the area reserved for short-lived objects?

This is where things get difficult, actually, and you need benchmarks because:

1\. You still have a bump allocator that's much faster than a general purpose
first-fit/best-fit allocator and has generally better memory locality than a
pool allocator.

2\. Offsetting that may be the additional tracing and/or copying you are doing
as a result of garbage collection.

3\. Manual memory management techniques often have their own performance
costs: std::shared_ptr and std::unique_ptr both add overhead and you sometimes
see additional copying where lifetimes are difficult to predict.

Which one has the higher cost is often something that can be only tested for
with actual code (and can go either way).

I'll also note that this is primarily of interest for functional languages,
which often have a high allocation rate. For imperative programs, the memory
allocation rate (and ratio of memory that contains pointers to memory that
doesn't) is often so low that even a very basic mark/sweep collector would be
fine, as long as pause times don't matter in your application domain. This
means that it's often not really a practical issue, one way or the other.

------
mcguire
" _In the file called test_asm.ml the function f_

    
    
        let f a = a+1
    

" _The generated asm looks like the following:_ "

    
    
          camlTest_asm__f_33700:
          .L119:
              addq    $2, %rax
              ret
    

You sure about that? $2?

~~~
brilee
From the article:

Unlike floats, ints are represented as immediates, i.e., they are not
allocated on the heap. To tag ints OCaml steals the LSB bit of an int and sets
it to 1 always. This means that we get 63 bits of precision in OCaml ints and
any int value n will appear in the assembly as 2*n+1 (the LSB is always 1 and
the bits of the number shifted one place in).

------
gcc_programmer
In the first code snippet, they havr add $2, %rax but the function actually
adds 1 to its argument.

~~~
thelema314
This is correct; OCaml integers are tagged with their LSB being 1 to indicate
they're not a pointer. This means that the integer n is stored as 2n+1, and
adding 1 requires modifying the stored value by 2.

------
jackmott
anyone that can explain or refer to explanations of why ocaml boxes floats?

~~~
thedufer
The section "OCaml Blocks and Values" at
[https://realworldocaml.org/v1/en/html/memory-
representation-...](https://realworldocaml.org/v1/en/html/memory-
representation-of-values.html) is a good explanation.

~~~
haimez
Let me summarize: academic language, it's easy for the GC that way, and "if
you though that Float boxing would kill performance you wouldn't have started
down this path"

~~~
ernst_klim
Afaik flabmbda can do some float unboxing, and float unboxing is not
impossible until you use generic value. Nope, OCaml is not "academic" but a
very pragmatic language.

------
chimtim
re-write it in C++

