
Neanderthal vs. ND4J – Native Performance, Java and CPU - dragandj
https://dragan.rocks/articles/18/Neanderthal-vs-ND4J-vol1
======
crockpotveggies
DL4J contributor here. I spoke with our team so the differences in performance
are likely explained by array ordering. ND4J, for good reason, requires F
ordering because of limitations in cuBLAS. While I haven't had the opportunity
to closely examine the Neanderthal comparison (I'm also not a clojure user),
the likely explanation is that there's an implicit ordering impacting this.

Deeplearning4j is written around F ordering and ND4J supports this.
Admittedly, our ordering API is not obvious to the average user.

Here's an example test you can run yourself that demonstrates ordering:
[https://gist.github.com/raver119/92b615704ca1bf169aa23a6a6e7...](https://gist.github.com/raver119/92b615704ca1bf169aa23a6a6e7d9880)

    
    
      o.n.i.TensorFlowImportTest - Orders: CCC; Time: 11532 ns;
      o.n.i.TensorFlowImportTest - Orders: CCF; Time: 2101 ns;
      o.n.i.TensorFlowImportTest - Orders: CFC; Time: 10202 ns;
      o.n.i.TensorFlowImportTest - Orders: CFF; Time: 1960 ns;
      o.n.i.TensorFlowImportTest - Orders: FCC; Time: 10744 ns;
      o.n.i.TensorFlowImportTest - Orders: FCF; Time: 1717 ns;
      o.n.i.TensorFlowImportTest - Orders: FFC; Time: 10097 ns;
      o.n.i.TensorFlowImportTest - Orders: FFF; Time: 1716 ns;
    

We also profiled the above test and confirmed that F -> C ordering adds
significant overhead. I can share screenshots if anyone is interested.

~~~
dragandj
While this is a good explanation, keep in mind that:

1\. This is the benchmark that you provided, so while this might be not
obvious to the average user, it also seems to not be obvious to the above
average user that wrote the benchmark. I was just using what you proposed,
assuming that you used the right thing in your library.

2\. It still does not explain how you got better performance with ND4J with
the same non-optimal call, which was what started the discussion, and inspired
this post.

3\. Neanderthal supports both Row and Column oriented order with cuBLAS with
the same performance, and won't have those problems that you mention for ND4J.

I'm, of course, interested in following up on this. Please decide on what
cases you'd like to compare, post the (optimal) code, and the ND4J and
Neanderthal numbers that you get, and I'll respond with my comments.

~~~
treo
It looks like while converting from my benchmarking code you've dropped the
'f' when creating the resulting array.

[https://github.com/treo/benchmarking_nd4j/blob/master/src/ma...](https://github.com/treo/benchmarking_nd4j/blob/master/src/main/java/com/example/neanderthal/NeanderthalComparision_1024x1024.java#L17)

The difference is rather huge with the newer versions of nd4j.

While the numbers in the following gists do not contain the measurements I
took for neanderthal, they do contain the numbers that I got for ND4J.

Without f ordering:
[https://gist.github.com/treo/1fab39f213da26255cf4f75e383ff90...](https://gist.github.com/treo/1fab39f213da26255cf4f75e383ff908)

With f ordering:
[https://gist.github.com/treo/94fe92c9417b5c8b24baa12924a35b0...](https://gist.github.com/treo/94fe92c9417b5c8b24baa12924a35b04)

As you can see something happened in the time between the 0.4 release (I took
that as the comparison point since that was when I ran my own benchmarks the
last time) and the 0.9.1 release that introduced additional overhead.

Originally I planned to create my own write-up on this, but I wanted to first
to find out what happened there.

Given that ND4J is mainly used inside of DL4J and the matrix sizes it is used
with usually are rather large, the performance overhead difference that I've
observed there for tiny multiplications isn't necessarily that bad, as the
newer version performs much better on larger matrices.

~~~
dragandj
You're right. In that particular case, ND4J comes to Neanderthal's speed. But
only in that particular case; and even then ND4J is still not faster than
Neanderthal. My initial quest was to find out whether ND4J can be faster than
Neanderthal, and I still couldn't find a case where it is.

Although, to my defense, the option in question here is very poorly
documented. I've found the ND4J tutorial page where it's mentioned, and even
after re-reading the sentence multiple times, I still do not connect its
description to what it (seems to) actually do. It also does not mention that
it affects computation speed.

Anyway, I'm looking forward to reading your detailed analysis, and especially
seeing your Neanderthal numbers.

~~~
agibsonccc
Fair point we are fixing now:
[https://github.com/deeplearning4j/deeplearning4j-docs/issues...](https://github.com/deeplearning4j/deeplearning4j-docs/issues/83)

We will be sending out a doc for this by next week with these updates. Thanks
a lot for playing ball here.

Beyond that, can you clarify what you mean? Do you mean just the gemm op?

For that, that's the only case that mattered for us. We will be documenting
the what/how/why of this in our docs.

Beyond that, I'm not convinced the libraries are directly comparable when it
comes to the sheer scope of the libraries to each other.

You're treating nd4j as a gemm library rather than a fully fledged
numpy/tensorflow with hundreds of ops and support for things you would likely
have no interest in building.

A big reason I built nd4j was to solve the general use case of building a
tensor library for deep learning, not just a gemm library.

Beyond that - I'll give you props for what you built. There's always lessons
to learn when comparing libraries and making sure the numbers match.

Our target isn't you though, it's the likes of google,facebook, and co and
tackling the scope of tasks they are.

That being said - could we spend some time on docs? Heck yeah we should. At
most we have java doc and examples. We tend to help people as much as we can
when profiling.

Could we manage it better? Yes for sure. That's partially why we moved dl4j to
the eclipse foundation to get more 3rd party contributions and build a better
governance setup. Will it take time for all of this to evolve? Oh yeah most
definitely.

No project is perfect and always has things it could improve on.

Anyways - let's be clear here. You're a one man shop who built an amazingly
fast library that scratches your own itch for a very specific set of use
cases. We're a company and community tackling a wider breadth of tasks and
trying to focus more on serving customers and adding odd things like different
kinds of serialization, spark interop,.. etc.

We benefit from doing these comparisons and it forces us to document things
better that we normally don't pay attention to. This little exercise is good
for us. As mentioned, we will document the limitations a bit better but we
will make sure to cover other topics like allocation and the like as well as
the blas interface.

Positive change has come out of this and I'd like to thank you for the work
you put in. We will make sure to re run some of the comaprisons on our side.

~~~
dragandj
Sure. I agree. You as a company have to look at your bottom line above all.
Nothing wrong with that.

Please also note that Neanderthal also has hundreds of operations. The set of
use cases where it scratches itches might be wider and more general than you
think.

The reasons I'm showcasing matrix multiplications are:

1\. That's what you used in the comparison. 2\. It is a good proxy for the
overall performance. If matrix multiplication is poor, other operations tend
to be even poorer :)

Anyway, as I said, I'll be glad to compare other operations that ND4J excells
at, or that anyone think are important.

I would also like to see ND4J's comparisons with Tensorflow or Numpy, or
PyTorch, or, JVM based MXNet.

~~~
agibsonccc
Yeah we definitely need to spend some more time on benchmarks after all it's
said and done.

That being said, while gemm is one op, it's a lot more than just jni back and
forth that use other libraries. What matters here are also things like
convolutions, pair wise distance calculations, element wise ops, etc.

There's nuance there.

There are multiple layers here to consider:

1\. The JNI interop managed via javacpp (relevant to this discussion)

2\. Every op has allocation vs in place trade offs to consider

3\. For our python interface, we have yet another layer to benchmark there (we
use pyjnius for jumpy the python interface for nd4j)

4\. Op implementations for the cuda kernels and the custom cpu ops we wrote.
(That's where our avx512 and avx2 jars matter for example)

For the subset we are comparing against, it's basically making sure we wrap
the blas calls properly. That's definitely something we should be doing.

We've profiled that and chose the pattern you're seeing above with f ordering.

That is where we are fast and chose to optimize for. You are faster in those
other cases and have laid that out very well.

Again, there's still a lot that was learned here and I will post the doc when
we get it out there to make that less painful next time.

You made a great post here and really laid out the trade offs.

I wish we had more time to run benchmarks beyond timing for our own use cases,
if we had smaller scope we would definitely focus on every case you're
mentioning here. We likely will revisit this at some point if we find it worth
it.

In general, our communications and docs can always be improved (especially our
internals like our memory allocation)

Re: your last point we do do this kind of benchmarking with tensorflow. For
example: [https://www.slideshare.net/agibsonccc/deploying-signature-
ve...](https://www.slideshare.net/agibsonccc/deploying-signature-verification-
with-deep-learning) (see slide 3 and also the broader slides for an idea of
how we profile deep learning for apps using the jvm)

We need to do a better job of maintaining these things though. We don't keep
it up to date and don't profile as much as we should. It has diminishing
returns after a certain point vs building other features.

I'm hoping a CI build to generate these things is something we get done this
year so we can both prevent performance regressions and have consistent
numbers we can publish for the docs.

Once the python interface is done that will be easier to do and justify since
most of our "competition" is in python.

------
vonnik
For anyone interested in looking at or running ND4J benchmarks, these links
may be useful.

[https://github.com/treo/benchmarking_nd4j/tree/master/src/ma...](https://github.com/treo/benchmarking_nd4j/tree/master/src/main/java/com/example/neanderthal)

deeplearning4j.org/native
[http://deeplearning4j.org/workspaces](http://deeplearning4j.org/workspaces)

We have our own garbage collection as well as native config and off heap
memory management.

We use this

[https://github.com/deeplearning4j/deeplearning4j/blob/master...](https://github.com/deeplearning4j/deeplearning4j/blob/master/nd4j/nd4j-backends/nd4j-backend-
impls/nd4j-native-platform/pom.xml#L22)

to specify use of MKL.

------
dragandj
Author here. I'm open for discussion.

The source is in the text, but conveniently accessible here:
[https://github.com/uncomplicate/neanderthal/blob/master/exam...](https://github.com/uncomplicate/neanderthal/blob/master/examples/benchmarks/src/benchmarks/neanderthal_vs_nd4j.clj)

~~~
thom
Huge thanks for Neanderthal and Bayadera! There is stuff that is basically
impossible to run in Stan that becomes feasible on a laptop with a GPU with
Bayadera, and the fact that I can do all this in Clojure without having to
migrate to Python or worse R has been a great boon.

~~~
dragandj
Thanks for the thumbs up!

------
shoyer
> I am almost sure that both would be faster than Numpy; that would be a good
> comparison.

This seems a little unfair :). I’m pretty sure that if you’re using NumPy
linked against MKL, it would be exactly as fast as Neanderthal running on MKL.

Matrix multiplication benchmarks themselves just aren’t that interesting when
all they are doing is testing an underlying library.

