
Nodejs+Fabric as fast as multi-threaded C++ - FabricPaul
http://fabric-engine.com/2011/11/server-performance-benchmarks/
======
allertonm
Not that it diminishes the interestingness of this post (quite the opposite),
but it's worth noting that this is not achieved using plain-vanilla
JavaScript. From the product page:

"The high-performance parts of the application are written using a
performance-specific extension to JavaScript, called KL (Kernel Language).
This language is similar in scope and syntax to JavaScript, but has some key
differences that optimize it for writing high-performance code.

Fabric applications are described as a dependency graph (think of a flow
chart). The dependency graph describes data, and the transformations that must
happen to that data. Fabric analyses the graph to discover where it is
possible to perform tasks in parallel. (task-based decomposition) Fabric
analyses the graph to discover where it is possible to perform the same
instruction on lots of data at the same time.

All of this is possible because Fabric has the LLVM compiler embedded within
it. This means that applications are dynamically compiled on target. The
Fabric plug-in has to be written for each platform, but Fabric applications
only have to be written once.

Fabric handles CPU multi-threading automatically, but the developer must
explicitly write code for the GPU using OpenCL."

~~~
itsnotvalid
It is something that JavaScript can never solve: Objects are objects, there
has to be some constructs to make them work so it would never be as fast as C
code.

The other blog post that appeared on HN this week sums it out [1].

For those who didn't read that post, basically, if you want the performance of
C, you have to make those data as static and inflexible, i.e. use types and
avoid indirection. Looking at the KL used in benchmark, you may find that it
resembles C code. It's exactly what they did to make it fast: types and free
from indirection.

[1]: [http://blog.mrale.ph/post/12396216081/the-trap-of-the-
perfor...](http://blog.mrale.ph/post/12396216081/the-trap-of-the-performance-
sweet-spot)

[2]: [https://github.com/fabric-
engine/Benchmarks/blob/master/Serv...](https://github.com/fabric-
engine/Benchmarks/blob/master/Server/ValueAtRisk/sort.kl)

~~~
fleitz
That's incorrect. Objects aren't objects, in imperative languages their
largely syntactic sugar for passing struct as a this pointer, and providing a
vtable. They don't actually implement many of the ideas from OO theory such as
sending messages to objects.

Look at a language OCaml which can consistently beat C or C++ which also
frequently beats C.

Most of the reason why C is fast is because it mirrors the hardware so closely
allowing people to write reasonably optimized portable assembler, C falls down
on macro optimizations such as inlining function pointers that are available
to higher level languages such as OCaml. Sometimes the marco and runtime
optimizations are far more important than the micro optimizations.

You don't need static typing or lack of object support to write fast code.
OCaml beats C providing both inferred typing and object support.

~~~
sausagefeet
Have a link to benchmarks where Ocaml beats C? In my experience Ocaml is
usually only about 2x slower than C, but never beating it. Also, Ocaml is
statically typed. Type inference is just inferring the static types.

~~~
fleitz
[http://shootout.alioth.debian.org/u32/benchmark.php?test=all...](http://shootout.alioth.debian.org/u32/benchmark.php?test=all&lang=gcc&lang2=ocaml)

[http://flyingfrogblog.blogspot.com/2009/07/ocaml-vs-f-
burrow...](http://flyingfrogblog.blogspot.com/2009/07/ocaml-vs-f-burrows-
wheeler.html)

You're right about OCaml generally being 1.5X slower than C but it does beat
it for some problems which is impressive for all the additional features it
provides. I must have been looking at the numbers wrong last time I checked.

Type inference is much different than static typing because it allows
functions to be specialized at run/compile time which can result in having to
write much less code.

For example, in C a map function would have to be written for every datatype
(or lose the benefits of static typing by using a void*) where as in OCaml/F#
you get type safety and specialized / inlined code for free.

~~~
sausagefeet
> Type inference is much different than static typing because it allows
> functions to be specialized at run/compile time which can result in having
> to write much less code.

You are confused, type inference is purely about determining what type
something is. It can determine a function is polymorphic but this has nothing
to do with how the polymorphism is implemented. AFAIK Ocaml (by that I mean
the INRIA Ocaml implementation) doesn't specialize polymorphic functions at
all. Types are boxed and the boxes are all the same size so a single function
definition is all that is needed for a polymorphic function or type. .net does
do these things but, again, that has nothing to do with type inference. There
is no difference between me specifying a function has type 'a -> 'b and the
compiler inferring that.

~~~
Locke1689
This is correct. The most common algorithm used for type inference is Hindley-
Milner (<http://en.wikipedia.org/wiki/Hindley%E2%80%93Milner>).

------
VMG
_The C++ versions were compiled using gcc version 4.4.5 using the compiler
flags “-O6 – lpthread”._

Isn't _-O6_ the same as _-O3_ ?

~~~
pjin
Fortran compilers (gcc has a Fortran 95 frontend) have long offered
optimization levels beyond -O3; i.e., -O6 is a decent upper bound among
popular free and commercial compilers, not just gcc/gfortran. The author
probably came from a HPC background where -O6 is commonly used for
compatibility among such a variety of compilers, even when writing C and C++
programs. Technically not the most portable way to build, but it works.

~~~
FabricPaul
I asked the engineer for a response: "For reference, I used -O6 because it's a
historical convention (more of a joke, really) for "optimize the crap out of
it". UNIX geeks have been using -O6 in this way for about 30 years."

------
dubajj
Defeating poorly optimized c++ code is far from impressive. I would also
recommend NOT running any performance tests on EC2 since they throttle cpu
unpredictably.

------
blub
I wouldn't call that code C++ at all, it's C with a few C++ keywords/niceties.
[1],[2]

[1] [https://github.com/fabric-
engine/Benchmarks/blob/f11cf6cc8cf...](https://github.com/fabric-
engine/Benchmarks/blob/f11cf6cc8cf4b3cab8048bf48e14dc651532be98/Server/ValueAtRisk/var-
st.cpp) [2] [https://github.com/fabric-
engine/Benchmarks/blob/f11cf6cc8cf...](https://github.com/fabric-
engine/Benchmarks/blob/f11cf6cc8cf4b3cab8048bf48e14dc651532be98/Server/ValueAtRisk/var-
mt.cpp)

~~~
Game_Ender
Yep and here is the C version: <http://pastebin.com/z7apLBHb> and it's diff:
<http://pastebin.com/YHjW6AHp>

On my machine the C and C++ versions essentially have the same speed.

------
FabricPaul
I've had a few email/tweet exchanges regarding KL, so I thought it would be
useful to clarify a few things:

\- KL is a language with a syntax that is very close to Javascript. It borrows
syntax from JavaScript, but not the rules of the language itself. Just like
JavaScript borrows syntax from C, and OpenCL borrows from C.

\- There are many things in Javascript we don't support. Some things we don't
currently support (eg. in-line initialization of arrays) but will probably
support in the future; other things we probably won't support (eg. regular
expressions as language objects) and finally there are things we will never
support (ie. closures).

\- There are nice features of KL that Javascript does not have, for instance
arithmetic operator overloading. These features are included because they are
particularly useful for computational problems.

\- KL is not JavaScript++ - It's designed for writing high-performance
operator code, not to handle everything that JavaScript can do. We don't want
to reinvent the wheel :)

We will work on a post to cover KL in more detail, including roadmap. You can
email info at fabric-engine dot com if you have any questions you want to take
offline.

Thanks, Paul

------
Srirangan
Extremely interesting. Since we're bench marking it with C++, is it safe to
assume it performs much better than the JVM?

~~~
spullara
Yes, I ported it and though the JVM performed well, it is beaten by the
optimizing C++ compiler.

[https://github.com/spullara/Benchmarks/blob/master/Server/Va...](https://github.com/spullara/Benchmarks/blob/master/Server/ValueAtRisk/VarMT.java)

------
apaprocki
While this is interesting, not all CPUs are created equal. I can run the var-
mt.cpp example on POWER hardware in:

    
    
      real    0m25.680s
    

My point being, that node.js doesn't work on this box, nor does Fabric. So
when developing specialized algorithms you sometimes might want to run them in
specialized environments.

~~~
FabricPaul
Right - we would have to build a version of the engine to run on that
hardware. Once that's done though, the applications would run. i.e. we
prototyped on ARM earlier this year, and our unit tests worked.

That said - if you're building for specialized environments, you're probably
going to want to hand-optimize code rather than rely on LLVM to do it. LLVM
does a pretty good job though :)

------
ravloony
So this software involves installing a plugin in the browser? In an age where
users are being warned by the browser makers themselves to be careful about
that sort of thing? I can't see this ending well.

~~~
FabricPaul
Hi there - it depends on who you're targeting. Our client-side focus is on
native developers looking for high-performance in web applications e.g.
medical visualization <http://vimeo.com/31970502> \- the benefits of web
applications are great, but performance is the current limiting factor.

We are likely to build support for NaCl in the future.

On the server-side, there is no requirement for a client to install a plug-in
- it's for high-performance on the server i.e. semantic analysis, compute
bound problems etc

Last point - we also think a lot about hybrid models where performance can be
accessed on both client and server. This is much longer term, but the design
of Fabric allows for it.

~~~
FabricPaul
p.s. you can see our sample client-side apps here:
<http://vimeo.com/groups/fabric/videos> and play with them at
<http://demos.fabric-engine.com>

You can how we've extended Fabric to include existing C++ libraries - great
for custom data types, streaming data etc

It is not really a consumer web technology - we're acutely aware of the plug-
in friction/antipathy. However - if you want HPC in the browser, this is one
of the only ways to do it.

------
spullara
Porting the C++ code directly to Java:

macpro:ValueAtRisk sam$ time java -cp . VarMT

VaR = -43.7173372179254300

real 0m36.863s user 4m48.944s sys 0m0.978s

macpro:ValueAtRisk sam$ time ./var-mt

VaR = -43.7173372179254329

real 0m27.048s user 3m33.971s sys 0m0.145s

Pretty good showing for such a low level benchmark.

[https://github.com/spullara/Benchmarks/blob/master/Server/Va...](https://github.com/spullara/Benchmarks/blob/master/Server/ValueAtRisk/VarMT.java)

It would be moderately interesting to see how well this benchmark would do in
OpenCL or the like.

(Benchmarks ran on a 3.33 GHz, 6-core MacPro, JDK 7 Developer Preview)

~~~
FabricPaul
cool :) Thanks for doing that - we can merge in at later date and include Java
results.

We expose OpenCL in Fabric as an extension, as we wanted to be able to target
heterogeneous hardware architectures. We didn't use this for benchmarking as
we wanted to show CPU performance first. For clarity - KL does not compile
down to OpenCL, you have to write for the GPU explicitly.

------
berkut
Could probably squeeze a bit more out of the C++ version by targeting the
specific architecture of the CPU to make use of SSE.

Also, what floating point type is KL using? float or double? - and is double
necessary? - converting the C++ code to use floats would probably provide a
fair speedup on the divides and due to squeezing more data into cache lines...

~~~
gcp
_Could probably squeeze a bit more out of the C++ version by targeting the
specific architecture of the CPU to make use of SSE._

Not only that, from browsing the code, the critical loop is likely matrix
multiplication. If that's the case, any kind of engine who is smart about SSE,
cache lines, etc is going to be able to outperform simple C/C++ code.

Of course there's excellent matrix maths libraries for C/C++ that could be
used instead.

~~~
to3m
More than half of the running time seems to be taken up by the generation of
normally-distributed random numbers. Sort of makes sense, I suppose, since
that bit has a loop and a `sqrt' and a `log' in it.

The repeated calls to `exp' seem to take up some time too.

As for the matrix multiplication, that only happens on startup, so it's surely
irrelevant. The bit that runs a lot just does matrix*vector. It is rather hard
to make that cache-incoherent, as it just walks forwards through all inputs
and outputs. In any event I would think that the program's entire working set
will fit in L1.

I was merely fiddling with this out of interest, so I didn't spend ages
SSE2ifying it. The VC++ x64 compiler doesn't do inline assembly language
anyway. But if you halve the number of multiply-adds `multMatVec' does, under
the assumption that this would make it twice as fast, and that twice as fast
would be what an SSE2 implementation would be like, it makes no noticeable
difference.

(I was fiddling with the single-threaded version, using Visual Studio 2010,
compiling for x64.)

------
tantalor
Is Fabric Engine commercial software? If so, the authors of this post ought to
disclose their interests as the owners.

~~~
FabricPaul
Hi there - apologies. Fabric is a commercial company. Fabric Engine will be
free for non-commercial use - we're currently in beta so pricing is not yet
finalized. I assumed the username 'FabricPaul' was a good indication, but
thanks for calling it out.

~~~
foobarbazetc
"free for non-commercial use" == commercial.

------
agoder
In a quick test I did, using a better compiler (gcc 4.6.2 or icc) makes it 15%
faster.

~~~
FabricPaul
Cool - can you contribute the code to the repository so we can test and merge
it in? Thanks

------
nphase
Fabric Server isn't available to play with?

~~~
FabricPaul
Hi there - not yet. We are going to start the alpha release very soon though.
You can play with the client-side stuff though - it's essentially the same
system. If you sign up to the newsletter you will get notification when we
drop the FE Server stuff

