pnichols's comments

pnichols · on May 21, 2019

Any comments on performance today or in the near future? Any features which should provide a big speedup in the future as compared to competitors (kdb, pandas)?

chrisaycock · on May 21, 2019

I've primarily been focused on the ergonomics of the language, so I've only tried to make performance "reasonable" for now.

Longer-term performance objectives are:

1. JIT - I designed the VM's byte code to be both interpretably and a mid-level IR to LLVM. Currently I just interpret everything since there is almost no runtime overhead for vector operations. However, compiled code will greatly speed-up any scalars in a loop.

2. SIMD - Since the VM's opcodes are already statically typed and vector-aware, integrating OpenBLAS and SLEEF (or Intel's MKL and VML) should be straightforward.

3. MIMD - Ideally I can just lean on existing libraries, though I'm not above embedding OpenMP if that gets the job done.

4. Distributed - Now comes the hard part. If we want MPI-level performance, I need to have more sophisticated scheduling. Which leads us to...

5. Streaming - This is the real holy grail. There has been a ton of research in the database community to get away from the "Volcano model" (iterators). I want to have the compiler generate streaming-aware opcodes for the VM based on the nature of how the data is to be consumed. I believe this will require a type system that can track the "context" of the computation, similar to how Koka and F* track side effects. I'm not aware of any general-purpose language that has compiled streaming.

corysama · on May 22, 2019

Looking at interpret.cpp for SIMD potential: I bet you could add an allocator for std::vector that aligns and pads everything to 32 bytes then just replace all of the scalar op loops with loops over AVX intrinsics. No need for an external library.

chrisaycock · on May 22, 2019

That's a possibility to get something running near term. I'm trying to avoid CPU-specific intrinsics since I have a fantasy that this might be run on ARM in the future, though that may be getting really ahead of myself.

corysama · on May 22, 2019

NEON intrinsics are pretty easy as well ;) As long as you are doing simple +-*&| ops they work the same as SSE.