More

stuntprogrammer · on July 17, 2017

Not kdb+, but a proprietary (internal-only) language that it heavily influenced was designed around execution on GPU clusters.

The FPGAs were used mainly for feedhandlers and there was a different DSL for that (compiling to verilog).

It was indeed rather something to see :)

throwaway7645 · on July 17, 2017

Was it received favorably? A secret sauce, or a failed project.

stuntprogrammer · on July 17, 2017

Secret sauce, at least for the team involved, and well-received. Nice combination of extreme perf and decent productivity for them.

I've moved into non-finance stuff for quite a while now though, so not sure what's become of it. Given business challenges in that particular sub-field, who knows..

stuntprogrammer · on March 13, 2017

Yep, he said while eyeballing a dashboard, there are plenty of larger private machines..

jacquesm · on March 13, 2017

I know this is not that site but I'd love a picture.

stuntprogrammer · on Jan 25, 2017

If the combination of such languages, high-performance hardware, and large scale compute problems is interesting.. the startup I work for in Mountain View is hiring...

hpcjoe · on Jan 26, 2017

:D Hope things are well by you!

stuntprogrammer · on Jan 26, 2017

And you too sir!

stuntprogrammer · on Dec 29, 2016

Indeed; as I've heard it ordered: 6 parts gin and a moment of silence for the vermouth.

stuntprogrammer · on Nov 29, 2016

Current publicly announced AVX512 does not support fp16. Skylake Server (SKX) and Knights Landing (KNL) are at a disadvantage here. They've not publicly said anything about extensions in Knights Hill (the long announced successor to KNL).

That said, Intel have announced the emergency "Knights Mill" processor jammed into the roundmap between KNL and Knights Hill. It's specifically targeted at deep learning workloads and one might expect FP16 support. They had a bullet point suggesting 'variable' precision too. I would guess that means Williamson style variable fixed point. (I also guess that the Nervena "flexpoint" is a trademarked variant of it).

I assume the FPGA inference card supports fp16. And Lake Crest (the first Nervena chip sampling next year) will support flex point of course. I would expect subsequent Xeon / Lake Crest successor integrations to do the same.

Fun times..

Aside on the compiler work -- I think it's not that hard to emit this instruction at least for GEMM style kernels where it's relatively obvious.

paulsutter · on Nov 29, 2016

Yes a compiler can generate the instruction. But if it's alone in a for loop surrounded by random STL classes - which even if inlined - are bodging up the pipeline or (gasp) causing spurious random dram accesses, there's little performance gain. And that's what usually happens in c++ code that wasn't already designed for AVX ("it's using AVX, but it's not running any faster. i guess AVX doesn't make much difference").

Net-net, data and code need to be structured for AVX to achieve the potential performance gains, and that's 80% of the work.

Once you structure the data and code for AVX, yes you can use regular C statements, then experiment with optimization flags until the compiler generates the intended instructions (and hasn't introduced excessive register spills). But its hard to see how that's any easier than using the intrinsics.

stuntprogrammer · on Nov 29, 2016

The problem is less the spurious DRAM accesses etc, as awful as they would be. The compiler problem is really a mix of 1) understanding enough about fixed-bound unit-stride loops to nonoverlapping memory (or transforming access to such) and 2) data layouts that prevent that. E.g. while there are well understood data layouts at each point of the compilation pipeline, it's hard in general for compilers to profitably shift from array of structs to struct of array layouts.

You are correct that, generally speaking, most STL heavy code would be hard to vectorize and unlikely to gain much advantage. (Plus there are the valarray misadventures). You will sometimes see clang and gcc vectorize std::vector if the code is simple enough, and they can assume strict aliasing. Intel's compiler has historically been less aggressive about assuming strict aliasing.

Various proposals are working through the standard committee to add explicit support for SIMD programming. E.g. if something like http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n418... were to be standardized we could write matrix multiply explicitly as:

  using SomeVec = Vector<T>
  for (size_t i=0; i<n; ++i) {
    for (size_t j=0; k<n; j+=SomeVec::size()) {
      SomeVec c_ij = A[i][0] * SomeVec(&B[0],j, Aligned);
      for (size_t k = 1; k < n; ++k) {
        c_ij += A[i][k] * SomeVec(&N[k][j], Aligned);
      }
      c_ij.store(&C[i][j], Aligned);
    }
  }

For my own work on vector languages and compilers I've had an easier time of it since they have been designed to enable simpler SIMD code generation.

stuntprogrammer · on Nov 28, 2016

An important detail is that this is not 130 petaflops of double precision floating point. Their target is >130 single or even half precision.

stuntprogrammer · on Nov 26, 2016

One tidbit is Google themselves saying they'll roll skylake out in cloud early next year.

https://cloudplatform.googleblog.com/2016/11/power-up-your-G...

stuntprogrammer · on Nov 13, 2016

I always recommend these:

https://scalableinformatics.com/assets/documents/Unison_Peta...

Surprisingly cost effective given the massive performance and support. Cloud is great for some things but not everything.

sijoe · on Nov 13, 2016

Thanks for the shout out! Without being a commercial, this high performance storage/analytics realm is what we focus on. We were demoing units like this: https://scalability.org/images/30GBps.png for years now. Insanely fast, tastes great, less filling. Our NVM versions are pretty awesome as well.

[edit]

I should point out that we build systems that the OP wants ... they likely don't know about us as we are a small company ...

https://scalableinformatics.com

and we do use gitlab

https://gitlab.scalableinformatics.com

...

Always happy to help ...

sytse · on Nov 13, 2016

Thanks for using GitLab and posting in https://gitlab.com/gitlab-com/infrastructure/issues/727#note... This looks really interesting.

stuntprogrammer · on Nov 8, 2016

Based on random sampling of those that contact me about such transitions, the latter is somewhat true. A couple have mentioned Julia but I wouldn't say it's the majority.

stuntprogrammer · on Oct 25, 2016