I wonder if AWS will incorporate these improvements into Redshift, which is Postgres forked for OLAP use cases where this is far more prevalent. In the past, they haven't done a great job at this (ran into a float bug fixed upstream 5+ years ago), but this may be a big enough value proposition to justify the effort.
Redshift is forked from ParAccel, which was originally forked from Postgres several years ago. ParAccel has had code generation (via transpilation to C) since long before Amazon even purchased its source code. In other words, Redshift has had this feature since its inception, albeit via C-transpilation (compiled with GCC) not LLVM codegen.
In addition, Redshift's primary index is a column store which is particularly well-suited to code generation (you can inline and even vectorize expressions across rows, without hopping pointers in between rows). Postgres does not yet include a column store, but this piece of infrastructure is a big step in the right direction!
It'll definitely not help for that, at least for now. There's not yet any caching around JITed programs, which makes it completely unsuitable for OLTP. With caching (which requires some larger changes to postgres, which is why this won't be in v11), it's possible to observe some moderate gains.
> I wonder if AWS will incorporate these improvements into Redshift, which is Postgres forked for OLAP use cases where this is far more prevalent. In the past, they haven't done a great job at this (ran into a float bug fixed upstream 5+ years ago), but this may be a big enough value proposition to justify the effort.
I can't imagine them doing that. This uses a whole rewritten expression evaluation engine (introduced in v10), and a large number of infrastructure changes all over the executor (reimplementation of targetlist set returning functions, projection, ...). Not impossible, but certainly not something lightly done.
There's not yet any caching around JITed programs,
which makes it completely unsuitable for OLTP.
I can't imagine them doing that.
This is a step in the right direction regardless, but I'm curious how many people are specifically using pure Postgres for large-scale OLAP models (not including occasionally running one-off queries on their OLTP data).
It has been merged into the master branch, which'll become PostgreSQL 11. First beta will be in a few weeks. Release is planned to be in September (early I'd hope).
> Will it be behind a flag, or does the planner just know when to use it?
There's cost based logic that chooses it, yes. See https://www.postgresql.org/docs/devel/static/jit-decision.ht...
You can change those if you want to force JIT on/off.
I know the RDS team has contributed some fixes and the like to Postgres, but I’m unsure of their overall significance.
Databases are a lot more interesting than they appear on the outside. They combine the disciplines of compilers (parsing, typing, optimization, codegen), operating systems (caching and memory management, concurrency, file systems) and distributed computing (replication, distributed transactions) all in one project.
That said, stale plans and parameter sniffing are problems with any system that optimizes and caches plans, regardless of whether the cached plans are compiled to bytecode or machine code (including Oracle, etc.).
The exception is Hekaton (MSSQL's in-memory database), which transpiles entire stored procedures to C (http://www.vldb.org/pvldb/vol7/p1095-nagel.pdf). This approach is very much optimized for in-memory OLTP, which is a different use case than Postgres's.
It's not entire queries that are JITed at this point, it's just a subset of the work. WHERE clauses are, but the logic around joins, just as an example, are not. Partially because it's a lot of work to do more, partially because it's often not very beneficial and you end up spending more time emitting & optimizing JITed code than you save.
> I imagine that overhead must not have been particularly low hanging fruit until recently.
I think it's largely because it's a huge project. I've spent a good chunk of two release cycles (a year each) on it. Not that easy to get an employer (or two as in my case) to sponsor work of that size. Thankfully both Citus and EnterpriseDB were willing to do so.
Hopefully now that the infrastructure is in we can JIT more things in smaller patches.
JIT query compilers were relatively common in closed source databases even a decade ago, it was one of their technical advantages versus open source. This is a necessary step toward closing that gap. There is a similar gap when it comes to storage/execution schedulers that is currently lacking in open source implementations which has a similar level of impact.
"most queries" - probably not. By sheer number that's going to be OLTP queries, and there it doesn't help. You need analytics queries that take upwards of 0.2s or such to benefit.
> Most queries are light on computation and heavy in IO.
That's something often said. I think it's definitely wrong today and at least has been for a while. In a lot of workload a good chunk of the hot data set is in memory, and even a single decent SSD can often more saturate a single core.
If you look at analytics benchmarks and real world analytics usage, you'll often see CPU being the bottleneck. Using multiple cores can alleviate that to some degree, but that can imply a need for a bigger hardware / less concurrency. And doesn't come for free. Efficiency is important.
For classic OLTP workloads, JIT compilation doesn't carry much benefit because you are (typically) directly targeting a few records. The primary benefit is for constraint matching in page scan operators. I would say the sweet spot is for low-latency operational analytics i.e. queries in mixed workload environments; the average dwell time of a query on a particular page has a big impact. These workloads (basically "real-time analytics") are increasingly popular.
Before I had experience designing systems with JIT compiled queries I worried quite a bit about overhead. The people with expertise that assured me the overhead would be pretty small if implemented well turned out to be correct, which allows you to JIT a lot of operations for material benefit.
it's pretty common