
Comparison of C++17, Go, and Java for a next-generation sequencing tool - techempower
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2903-5
======
ezoe
After quickly glancing the code, I concluded that they wrote C++ like there is
no static type. It seems they faithfully ported the very dynamic nature of
their existing code to C++ without thinking.

Like what is this? [https://github.com/ExaScience/elprep-
bench/blob/master/cpp/f...](https://github.com/ExaScience/elprep-
bench/blob/master/cpp/filter_pipeline.cpp#L20-L33)

    
    
      auto alns = any_cast<shared_ptr<deque<shared_ptr<sam_alignment>>>>(data);
    

So the data is sam_alignment type inside shared_ptr inside deque inside
another share_ptr inside god forbid any? Why did they do that? What kind of
abomination is this? Also from the context, it is the only possible type. they
use:

    
    
      try { any_cast<abomination_type>(data) ; }
      catch ( bad_any_cast ) { throw runtime_error(...) ; }
    

If you're so sure an object of any only hold exactly one type and everything
else is unexpected error, you shouldn't use any at all!

They really like std::deque<T> and use it everywhere even though the sizeof(T)
is like a few dozen bytes at best so they should rather use std::vector. The
data structure of deque is a list of array. while it can amortize the
continious adding of the elements to front or back, since the element size is
very small, they should rather use vector.

Speaking of data structure, they also use std::unordered_map<int, any>. the
unordered_map is very slow(it's a node based hash map, not suitable for the
modern hardware) and the sizeof(int) + sizeof(any) is like 20
bytes(sizeof(int) + two pointers) so they got no benefit of using node based
data structure here. They should rather use sorted vector and binary search
it.

My conclusion, it's slow because they wrote C++ like a dynamic typed language
and they choose the wrong data structures.

~~~
WesternStar
<unpopular opinion>We should break ABI on unordered map just to stop
embarrassing ourselves in public and in front of new users.</unpopular
opinion>

~~~
alexhutcheson
It’s unfortunately not just ABI, but also API. The standard specifies that you
can get iterations to specific buckets in O(1)[1], and also specifies
bucket_count(), max_bucket_count(), bucket_size() (which is specified to be
O(n)), and bucket(). Those functions and their specified performance make it
effectively impossible to implement a standards-compliant std::unordered_map
without using separate chaining.

[1]
[https://en.cppreference.com/w/cpp/container/unordered_map/be...](https://en.cppreference.com/w/cpp/container/unordered_map/begin2)

~~~
SamReidHughes
You can just remove those functions. The real problem is that the API break
would invalidate iterators and create undefined behavior.

------
Rochus
I had a quick look at the C++ source code provided at
[https://github.com/ExaScience/elprep-
bench/tree/master/cpp](https://github.com/ExaScience/elprep-
bench/tree/master/cpp).

As suspected, everything is dynamically allocated and no memory mapping (see
e.g. [http://man7.org/linux/man-
pages/man2/mmap.2.html](http://man7.org/linux/man-pages/man2/mmap.2.html)) is
used. No wonder this is slow and eats a lot of memory. At the moment I have no
information about why this design was chosen, if there is a justification for
it, or if the developers only knew this option. Maybe I can find some hints in
the paper. From what I've seen up to now it can be savely assumed that with
optimal use of data structures and system functions the C++ results are at
least one order of magnitude better.

~~~
benhoyt
They address that somewhat in the discussion section:

> C++ provides many features for more explicit memory management than is
> possible with reference counting. For example, it provides allocators [35]
> to decouple memory management from handling of objects in containers. In
> principle, this may make it possible to use such an allocator to allocate
> temporary objects that are known to become obsolete during the deallocation
> pause described above. Such an allocator could then be freed instantly,
> removing the described pause from the runtime. However, this approach would
> require a very detailed, error-prone analysis which objects must and must
> not be managed by such an allocator, and would not translate well to other
> kinds of pipelines beyond this particular use case. Since elPrep’s focus is
> on being an open-ended software framework, this approach is therefore not
> practical.

~~~
Rochus
Of course you can use better allocators; but it's faster to avoid dynamic
allocation (e.g. by pointing to memory mapped from the input file by the OS)
altogether. If they allocate memory for each flyspeck of a 200 GB file and
also create and change a reference counter for it, nobody should be surprised
about the low performance. Have a look what e.g. shared_ptr does behind the
scenes.

~~~
PaulDavisThe1st
Unless you're streaming, in which case mmap'ed access on Linux is generally
slower than read/write. At least it was the last time we checked for the
Ardour project (probably about 3 years ago).

~~~
Rochus
See
[https://en.wikipedia.org/wiki/SAM_(file_format)](https://en.wikipedia.org/wiki/SAM_\(file_format\))

------
twic
> To achieve good performance, it was therefore necessary to explicitly
> control how often and when the garbage collector would run to avoid needless
> interruptions of the main program, especially during parallel phases.

Why? This is a batch program, interruptions don't matter, only the end-to-end
time does.

> The goal of elPrep is to simultaneously keep both the runtime and the memory
> use low.

Why? Keeping runtime low lets you get more work done. Keeping memory use low
means what? They are using a machine with 384 GB RAM, make use of it.

Worth noting also that they used GCC 7.2.1, Go 1.9.5, and Java 10. That's a
pretty old GCC.

They don't seem to explicitly select a GC with Java, so they'll be using G1.
G1 is still not entirely mature. It got much faster between 9 and 10, and
somewhat faster from 10 to 11. For a batch process like this, though, the
parallel collector is still probably a better choice. Using a newer JDK and a
different collector should give better performance - but admittedly, probably
won't reduce heap usage.

~~~
WatchDog
Yeah there are lots more knobs they could have played with in regards to Java
garbage collection.

I wonder if they capped the java max heap size to what the go implementation
used, how much it would have affected the runtime.

------
zmmmmm
It has to be noted: it's quite a strange approach all round - this framework
reads all the data into memory. So if you have a 100GB genome it will read
100GB into memory. Presumably it stays in memory uncompressed so we are
talking hundreds of GB to process even a single whole genome sample.

This may indeed have some performance benefits, but it's a very impractical
approach from a hardware point of view. Few places doing processing of genomic
data will have many compute nodes with > 256GB memory, yet that would barely
process 1 sample with this framework. God forbid you have a family of samples
or tumor/normal comparison samples to analyse and need several genomes in
memory together.

Genomes are for the most part massively parallelisable and nearly every other
toolkit I have seen has put that first and foremost in its design approach.
Ensuring tools process data in a streaming manner and pipe between each other
is a basic expectation of most genomic data tools.

Which is all to say ... this is a very strange beast and I'm not sure a lot of
conclusions can be drawn from it that generalise to other activities or
approaches.

~~~
yvdriess
Untrue. Large memory nodes are par the course for genomics workloads. Only a
few stages of the analysis pipeline can stream the in/out effectively. Even
then, going to disk for the result only to be read back is going to bottleneck
your pipeline.

The authors go in more detail around this topic in:
[https://journals.plos.org/plosone/article?id=10.1371/journal...](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0209523)

------
jolmg
> Based on our benchmark results, we selected Go as our new implementation
> language for elPrep, and recommend considering Go as a good candidate for
> developing other bioinformatics tools for processing SAM/BAM data as well.

They don't seem to take into account that their results depend on their own
proficiency in each programming language.

~~~
vardump
> They don't seem to take into account that their results depend on their own
> proficiency in each programming language.

If they're going to implement it as well, then that's perfect. They now know
which language _they_ should be using with their proficiency.

In some other team, Java might have been #1. The overall best results with a
great team would almost certainly be achievable by using C++.

But this is what worked for them.

~~~
jolmg
Yes, for them that's great, but their statement sounds like they're suggesting
their results would apply for everyone.

~~~
vardump
Funny how people think what works for them works for everyone... so easy to
fall into that trap even with the best intentions.

------
voldacar
This is not surprising. Parallel GC will almost always be faster than
refcounting. If you wrote a C or C++ program to accomplish this task after
carefully planning out exactly when stuff needs to get manually
malloced/freed, you could outperform any of their approaches

And like another commenter mentioned, if you're writing a program which
streams a lot of data sequentially from disk, and where throughput is
important (such as in sequencing), you should always be using mmap

~~~
ori_b
That's a case where mmap isn't actually all that much faster than read, and
due to the inter processor interrupts needed to synchronize the memory
mappings across cores, it may end up much slower. You're grabbing large chunks
and flushing the TLB a whole lot.

If you are seeking randomly and doing small reads, then mmap will help quite a
bit: the data will be faulted in, and accessing it a second, third, or
hundredth time will not cost much.

~~~
gpderetta
read mapping are actually not too bad. remote TLBs can be sychronised lazily
on a page fault.

------
pizlonator
Very surprising result. I wouldn’t have bet that this is what would have
happened.

But anyone working on language perf should take note even though it’s just one
result from one team and one application. Of course they probably used C++ in
a not great way and probably use Go in a better way. But maybe that is caused
by something in Go that encourages good behavior or at least encourages the
kind of behavior that Go optimizes for.

So, even if this result doesn’t mean that C++ devs should switch to Go to get
more speed, it’s a result that is worth pondering at least a bit, particularly
if you like thinking about what it is that makes languages fast or slow.

~~~
shigeo
> I wouldn’t have bet that this is what would have happened.

Which part of the results are you referring to? It's well-known that reference
counting has significantly lower throughput that tracing garbage collectors,
so the fact that C++ is outperformed here isn't surprising at all.

~~~
pizlonator
Great point. Here’s the issue: there are tons of ways of doing reference
counting in C++. Some go all-in with implied borrowing. Some make great use of
C++’s various references. Some use the reference counting only for a subset of
objects and rely on unique_ptr for hot things.

So, there is no universal answer to how C++ reference counting compares to GC.

There is a well known answer, that you’re probably referring to, if you
reference count every object and you do it soundly by conservatively having
every pointer be a smart pointer. But it just isn’t how everyone who does
reference counting in C++ does it, and I was surprised because I’m used to
folks going in the other direction: being unsound as fuck but hella fast.

~~~
kllrnohj
> there are tons of ways of doing reference counting in C++

There are but critically there are also a lot of ways to not do ref counting
_at all._ C++ isn't a refcounted language, it's a language where you _can_ use
refcounting (shared_ptr), but you don't _have_ to (unique_ptr, value types).
It's not even recommended to be primarily refcounted.

They chose a really odd subset of C++ to use here (shared_ptr exclusively),
very unorthodox and not something I've ever seen elsewhere or recommended.

------
typon
On one hand this comparison sucks because their C++ is quite non idiomatic, in
the other hand it reflects something about C++. Why is it so easy to write non
idiomatic C++? And so much so that it's much slower than a garbage collected
language.

------
avarsheny
Guys hire better C++ programmers rather making stupid claims.

~~~
lostmsu
Or programmers in general. I bet they did not think, that Java consuming more
memory can be inconsequential as in: it would free it if pressed to.

------
zwieback
The fact that gc is well suited to this application isn't suprising but what I
thought was interesting was that subtracting the time to do the deallocation
from the C++ benchmark brought it into line with Go. In other words, ignoring
mem management, for this application, Go and C++ performed on par.

~~~
gpderetta
There is a huge amount of gratuitous reference count updates and double ptr
indirection (see for example their string_slice). Those add up quickly.

The rest is a lot of string manipulation. If you are not taking advantage of
being able to layout your objects carefully and avoid memory avoiding
allocations, I wouldn't expect C++ to have any particular advantage over Go or
Java in in this particular scenario.

------
DeathArrow
That only demonstrated that poor usage of the language can make C++ slow.

------
aidenn0
I thought it was well known that reference counting is slower than all but the
worst tracing GCs.

~~~
DeathArrow
True. C# and Java constantly beat Swift on speed.

~~~
etse
Sorry about the naive question, but if the memory management overhead is worse
in Swift, is the hardware it runs on typically better? I'm assuming some of
this because I've noticed Android devices tend to require more CPU/memory
compared to iOS devices in the same generation.

~~~
ken
[https://lists.swift.org/pipermail/swift-evolution/Week-of-
Mo...](https://lists.swift.org/pipermail/swift-evolution/Week-of-
Mon-20160208/009422.html)

Lattner on Swift (2016): "...while it is true that modern GC's can provide
high performance, they can only do that when they are granted _much_ more
memory than the process is actually using. Generally, unless you give the GC
3-4x more memory than is needed, you’ll get thrashing and incredibly poor
performance..."

~~~
aidenn0
Agreed. I would say 2x the heap size is table-stakes for high-performance
tracing GC, and the more you give it the better performance you can get.

------
madhadron
Here is the key point:

> elPrep allows users to specify arbitrary combinations of SAM/BAM operations
> as a single pipeline in one command line.

The assumption is that your native environment for data analysis is bash and
you have to expose anything that you might want to do in bash. Further,

> elPrep also accommodates functional steps provided by third-party tool
> writers

That is, they are attempting to provide general semantics in a bash command
line that they can handle arbitrary functions being dropped into their
program.

Remember, the average salary of a job requiring both programming and biology
knowledge is much lower than the average salary of a job requiring programming
knowledge alone, so as bioinformaticists build skill to the point where they
can be employable as programmers, they mostly leave bioinformatics.

Those who have escaped bash have usually done so into Python these days, which
creates a class system of those gluing things together in Python versus those
implementing algorithms in C or C++.

------
cppbert3
reply of the author

[https://github.com/ExaScience/elprep-
bench/issues/3](https://github.com/ExaScience/elprep-bench/issues/3)

------
zmower
"We have not performed a detailed comparison against the original version of
elPrep implemented in Common Lisp, but based on previous performance
benchmarks, the Go implementation seems to perform close to the Common Lisp
implementation." LMFAO.

~~~
tomp
If I understand correctly, new Go code was as fast as old Lisp code, with much
less effort and much clearer code.

 _> Most existing Common Lisp implementations use stop-the-world, sequential
garbage collectors. To achieve good performance, it was therefore necessary to
explicitly control how often and when the garbage collector would run to avoid
needless interruptions of the main program, especially during parallel phases.
As a consequence, we also had to avoid unnecessary memory allocations, and
reuse already allocated memory as far as possible, to reduce the number of
garbage collector runs. However, our more recent attempts to add more
functionality to elPrep (like optical duplicate marking, base quality score
recalibration, and so on) required allocating additional memory for these new
steps, and it became an even more complex task and a serious productivity
bottleneck to keep memory allocation and garbage collection in check._

------
teleforce
It seems that all the languages implementation in the benchmarking exercises
are utilizing garbage collection (C++ using ref counting) it will be very
interesting if someone extend this benchmark using D programming language,
arguably the fastest language with GC. The fact that elPrep tool is using
functional software architecture [1] and D natively supports functional
programming makes me think this type of computing is very well suited for it.

[1]
[https://github.com/exascience/elprep](https://github.com/exascience/elprep)

------
fnord123
It would be interesting to see an output of the compiler flags used in the C++
program that was benchmarked. The build script doesn't set an optimization
level so it would default to -O0:

[https://github.com/ExaScience/elprep-
bench/blob/master/cpp/m...](https://github.com/ExaScience/elprep-
bench/blob/master/cpp/make.sh)

I am sure they didn't benchmark it like this but it would be interesting to
see the flags that /were/ used.

~~~
gpderetta
The jmalloc script does set the opt level and that's the one discussed in the
paper.

------
techempower
the source code: [https://github.com/ExaScience/elprep-
bench](https://github.com/ExaScience/elprep-bench)

~~~
jolmg
That has the Java and C++ versions. Here's the Go version:

[https://github.com/exascience/elprep](https://github.com/exascience/elprep)

------
alexeiz
They use shared_ptr pervasively in their C++ implementation. It's not the best
way to manage objects in C++ by far. The only thing that this performance
comparison really tells me is that shared_ptr is worse than a modern GC. This
is not really a surprising find.

------
greendave
Probably worth noting in the title that this is a June 2019 publication, so
the versions used (gcc 7.2.1, go 1.9.5) etc. are not exceptionally ancient -
basically they used what was readily available in CentOS 7 around that time.

That said, some of the code used is quite... odd.

------
exabrial
They're using "Java 10", (no mention which vendor or build) any idea if they
had the compressed strings option enabled?

------
pulse7
They should benchmark Java's implementation with different GCs...

------
nemetroid
I think the original title is better, although dropping "full-fledged next-
generation" would be an even better choice.

> A comparison of three programming languages for a full-fledged next-
> generation sequencing tool

~~~
Magnap
"next-generation sequencing" is a term of art in this case

~~~
nemetroid
I see, it sounded like a buzzword term (which I guess it still might be). The
point is that the current title makes it sound like a general comparison,
while the original title makes it clear that it's comparing three
implementations of a single tool.

------
google234123
Ignoring the terrible C++, why would they bother to write the same program in
three languages. I feel like spending 3x time on just one of them would have
produced the best outcome.

~~~
teleforce
In academic publication it is a convention to compare your proposed
implementation with at least the other two comparable competitors (languages,
framework, algorithm, scheme, etc). I think it is naturally intriguing and
refreshing to see the performance comparison of programming languages by the
the non-author of the languages themselves even though the implementations are
probably not optimized to death.

------
qzw
TLDR: 1. Go, 2. Java, 3. C++17.

Java was slightly faster than go but used significantly more memory. C++17 was
slower than both and used more memory than go.

I’m still reading for more details on the implementations, but the results are
certainly not what I would’ve predicted.

~~~
adrian_b
It seems that the performance was dominated by memory management, so the
comparison is not between the languages per se, but between their current
garbage collectors, and respectively the reference counting implementation for
C++.

~~~
beering
Yeah, further down in the article they track down the gap between cpp and
Go/Java to the ref-counting deallocation work. I'm not a cpp expert, but it
seems surprising to me that GC would beat ref-counting in any scenario.

~~~
aidenn0
I guess I was wrong with this comment:

[https://news.ycombinator.com/item?id=22959600](https://news.ycombinator.com/item?id=22959600)

1\. Reference counting is a form of GC; you could implement a JVM that used
reference counting (though in order to be general a small amount of additional
work is needed)

2\. Reference counting causes extra work every time a reference appears or
disappears. Tracing GCs amortize that cost across many allocations.

2.b. This is particularly hurtful to performance for short-lived objects,
since most tracing GCs have zero GC overhead for short-lived objects (the cost
of a nursery collection under most implementations scales with the amount of
_live_ data in the nursery, so objects that appear and disappear in the time-
span of a single nursery GC are freed at zero extra cost). Furthermore a
tracing GC

3\. Malloc cannot move allocated data, so many implementations have a lot of
complexity to avoid heap fragmentation, which comes at a cost to both
allocating and freeing data. Many GC'd languages allocate small objects with a
single instruction in the typical (just incrementing a pointer, the non-
typical case would be when the nursery is full and a GC happens).

4\. the JVM and Go both have a lot of effort put into their GC; the ref-
counting implementation used by this test is probably a bit more naive. In
particular they talk about large delays when a chain of links cause many
allocations to die at the same time. A less naive refcounting implementation
would queue deleted objects and spread that work out across a larger time
period.

~~~
tomp
> you could implement a JVM that used reference counting

It would leak memory - reference counting cannot collect cycles (you need
tracing GC for that, defeating the purpose of refcounting).

~~~
aidenn0
From right after what you quoted:

> (though in order to be general a small amount of additional work is needed)

Not also that there are two methods of cycle detection for a reference counted
GC that are not just a backup tracing-GC

1\. Trial deletion (known since at least the mid 80s)

2\. Various tracing systems that exploit extra information known to reference-
counted systems e.g. Levanoni/Petrank[1] which actually implemented a
reference counted GC for Java.

1:
[https://www.cs.technion.ac.il/~erez/Papers/refcount.pdf](https://www.cs.technion.ac.il/~erez/Papers/refcount.pdf)

~~~
tomp
Thanks, I didn't know that!

------
zerr
No AWK and Ada this time? :)

------
jayd16
Who uses GBh to measure performance? Either you can afford the ram or not. If
you can't, you should bench with swap performance.

