Amazing overview. Right now I work on a completely new OLAP database: vectorized query execution is must have nowadays if you want to have competitive performance. What's interesting is that vectorized query engines benefits a lot from SIMD instructions (that is obvious). But what is less obvious is that rising popularity of ARM CPUs made it more complex to developer vectorized query engine because one has to provide support for both architectures.
> because one has to provide support for both architectures
You aren't using LLVM IR then? I found that if you are a bit careful, LLVM is perfectly capable to lower IR to vector instructions in ARM/X86 automatically.
idk why you would use a virtual machine. clustrix did for a while until the instruction dispatch loop dominated profiling. then one particularly clever and energetic engineer just lowered onto x86
I did the switch from a tree walk interpreter to a VM in postgres - it did lead to substantial speedups.
I've experimented with lowering into x86 "manually" as well, and it did provide gains, but the portability aspects are concerning. And for much of the desirable speedups you ime want more optimization than trivial lowering would give you.
Either way, without the tree walk -> VM move, it'd have been much harder to lower into x86 directly. And you'd still need a fallback for other platforms.
A recent blog post about MongoDB's SBE detailed that the purpose behind their VM is that it serves as a way to re-use an execution component between two different query languages. SQLite's claim was just that the strong separation between compilation and execution makes issues easier to debug.
I wouldn't expect VMs to become the default design in databases, but it seems like it's getting increasingly common as an IR for query compilation. The ability to have a (comparatively simpler) interpreter for the VM also means you can apply simple fuzzing to great effect: if the results of interpretation vs compilation ever diverge, there's a bug.
It would surprise me to see the dispatch loop dominate in a DB? given that on the whole databases tend to block a lot on more expensive things like materialization, joins, etc?
I do think that a VM could mess up CPU branch prediction so I can see that as being a problem?
Some people have gone the route of emitting LLVM IR (or similar) for query plans, but this has a cost, too, so only makes sense for repeatedly used queries.
It really depends on the query. If there's a bunch of math in the select list, it's easy for that to be the biggest bottleneck, even if there are joins etc. Also you'll sometimes have to evaluate expressions to check join conditions depending on the join implementation and condition.
> I do think that a VM could mess up CPU branch prediction so I can see that as being a problem
It definitely happens. It can be partially addressed with things like computed gotos, but it's far from perfect. At least for postgres, it's just that the tree walk overhead was far higher.
The branch prediction handling for opcode dispatch loops definitely improved in CPUs - for Intel somewhere around Haswell (it's been a while, so this is a guess). I remember that, at the time, the gains of using computed goto on my aging workstation (nehalem) were far bigger than a few uarchs later, even when controlling for cache size etc.
Semi related anecdote: Both opcode dispatch switches and computed goto based dispatch are good for stressing compilers. I found newly introduced >= quadratic behavior in all the major c compilers in the years since.
clearly not in I/O bound cases, but in that world everything except the b-tree was done relationally, so yeah joins and view and the whole lot. we may have also had a really poorly though out VM. but direct instruction generation wasn't really a big deal and it helped quite a bit