Hacker News new | past | comments | ask | show | jobs | submit login
PostgreSQL Begins Adding LLVM JIT Support For Faster Performance (phoronix.com)
269 points by manigandham on April 7, 2018 | hide | past | favorite | 28 comments



This actually has been fully merged since (JIT compiled expressions (projections, aggregates, etc), tuple deforming, operator implementation inlining). There's a lot of further work to do however (caching, non-crappy code for expressions, better planner logic).


Link to the (development version) docs https://www.postgresql.org/docs/devel/static/jit.html


This is exciting. That said, it's important to note that this is unlikely to help much with most OLTP queries just looking up information quickly via indexed data. Where this could really shine is CPU-bound queries that are performing complex operations, where the SQL interpreter is actually a bottleneck.

I wonder if AWS will incorporate these improvements into Redshift, which is Postgres forked for OLAP use cases where this is far more prevalent. In the past, they haven't done a great job at this (ran into a float bug fixed upstream 5+ years ago), but this may be a big enough value proposition to justify the effort.


> I wonder if AWS will incorporate these improvements into Redshift, which is Postgres forked for OLAP use cases where this is far more prevalent. In the past, they haven't done a great job at this (ran into a float bug fixed upstream 5+ years ago), but this may be a big enough value proposition to justify the effort.

Redshift is forked from ParAccel, which was originally forked from Postgres several years ago. ParAccel has had code generation (via transpilation to C) since long before Amazon even purchased its source code. In other words, Redshift has had this feature since its inception, albeit via C-transpilation (compiled with GCC) not LLVM codegen.

In addition, Redshift's primary index is a column store which is particularly well-suited to code generation (you can inline and even vectorize expressions across rows, without hopping pointers in between rows). Postgres does not yet include a column store, but this piece of infrastructure is a big step in the right direction!


Oh, neat! I hadn't heard of ParAccel. Suppose I'm missing some history here.


> That said, it's important to note that this is unlikely to help much with most OLTP queries just looking up information quickly via indexed data.

It'll definitely not help for that, at least for now. There's not yet any caching around JITed programs, which makes it completely unsuitable for OLTP. With caching (which requires some larger changes to postgres, which is why this won't be in v11), it's possible to observe some moderate gains.

> I wonder if AWS will incorporate these improvements into Redshift, which is Postgres forked for OLAP use cases where this is far more prevalent. In the past, they haven't done a great job at this (ran into a float bug fixed upstream 5+ years ago), but this may be a big enough value proposition to justify the effort.

I can't imagine them doing that. This uses a whole rewritten expression evaluation engine (introduced in v10), and a large number of infrastructure changes all over the executor (reimplementation of targetlist set returning functions, projection, ...). Not impossible, but certainly not something lightly done.


  There's not yet any caching around JITed programs,
  which makes it completely unsuitable for OLTP.
Oh, I see. Thanks for clarifying! You mentioned it was merged. Does this mean it'll land in stable soon? Will it be behind a flag, or does the planner just know when to use it?

  I can't imagine them doing that.
RE: Redshift, That's very unfortunate :|. I suppose that's another downside of AWS forking vs. say Citus who moved to the extension model. Though, this would really be perfect for Redshift's use cases... so I guess we'll see.

--

This is a step in the right direction regardless, but I'm curious how many people are specifically using pure Postgres for large-scale OLAP models (not including occasionally running one-off queries on their OLTP data).


> Oh, I see. Thanks for clarifying! You mentioned it was merged. Does this mean it'll land in stable soon?

It has been merged into the master branch, which'll become PostgreSQL 11. First beta will be in a few weeks. Release is planned to be in September (early I'd hope).

> Will it be behind a flag, or does the planner just know when to use it?

There's cost based logic that chooses it, yes. See https://www.postgresql.org/docs/devel/static/jit-decision.ht...

You can change those if you want to force JIT on/off.


Isn't it against AWS interest to boost base postgresql or reduce differentiation against base postgresql? I wish they contributed to postgresql but that would weaken the lock-in of their core offering.


Yes and no. Maybe yes for OLAP features but not in general, because Redshift isn’t fit for everything, and they have plain RDS Postgres offerings. Also, my original post questioned whether they would bring in improvements from Postgres core into their fork (Redshift)... in case you read the opposite.

I know the RDS team has contributed some fixes and the like to Postgres, but I’m unsure of their overall significance.


> I know the RDS team has contributed some fixes and the like to Postgres, but I’m unsure of their overall significance

Negligible.


Anyone who's interested in this (and related) DBMS topic would be well served to check out the videos of Andy Pavlo's "Advanced Database Systems" course at CMU: https://www.youtube.com/watch?v=poEfLYH9W2M&list=PLSE8ODhjZX...


For what it’s worth MSSQL has been doing this for years. What you get in return is the chance to debug parameter sniffing and stale plans.


Databases have been generating machine code for things like sorts since the 60s. JIT code for queries is a very old approach.

Databases are a lot more interesting than they appear on the outside. They combine the disciplines of compilers (parsing, typing, optimization, codegen), operating systems (caching and memory management, concurrency, file systems) and distributed computing (replication, distributed transactions) all in one project.


MSSQL has been compiling plans to bytecode, but not in a manner which can be directly or JIT compiled to machine code. There is a big difference in compilation & runtime performance with an LLVM-based approach, which can be directly or JIT compiled.

That said, stale plans and parameter sniffing are problems with any system that optimizes and caches plans, regardless of whether the cached plans are compiled to bytecode or machine code (including Oracle, etc.).

The exception is Hekaton (MSSQL's in-memory database), which transpiles entire stored procedures to C (http://www.vldb.org/pvldb/vol7/p1095-nagel.pdf). This approach is very much optimized for in-memory OLTP, which is a different use case than Postgres's.


I do not think there should be any issues since PostgreSQL already handles invalidation of plans for when indexes and tables change. Adding JIT does not add any new hazard here.


This is really cool. I've often wondered why queries weren't be compiled to native code to remove interpreter overhead. I imagine that overhead must not have been particularly low hanging fruit until recently.


> I've often wondered why queries weren't be compiled to native code to remove interpreter overhead.

It's not entire queries that are JITed at this point, it's just a subset of the work. WHERE clauses are, but the logic around joins, just as an example, are not. Partially because it's a lot of work to do more, partially because it's often not very beneficial and you end up spending more time emitting & optimizing JITed code than you save.

> I imagine that overhead must not have been particularly low hanging fruit until recently.

I think it's largely because it's a huge project. I've spent a good chunk of two release cycles (a year each) on it. Not that easy to get an employer (or two as in my case) to sponsor work of that size. Thankfully both Citus and EnterpriseDB were willing to do so.

Hopefully now that the infrastructure is in we can JIT more things in smaller patches.


Actually, the performance benefits can be huge (design dependent) and have been for a long time. The reasons have more to do with complexity and difficulty of implementation. The query compilation and execution engine must preserve the operational robustness, integrity, and behavior of the database engine, which is not a trivial design and implementation problem. The number of things that can go horribly wrong increase significantly versus an interpreter.

JIT query compilers were relatively common in closed source databases even a decade ago, it was one of their technical advantages versus open source. This is a necessary step toward closing that gap. There is a similar gap when it comes to storage/execution schedulers that is currently lacking in open source implementations which has a similar level of impact.


As an example of something that can go wrong: Apache Impala uses LLVM to JIT queries. Turns out some code optimization algorithms are O(n^3) (or worse) or so, where n is small enough not to matter too much with human code, but falls apart with a few thousand columns in a single table, not super unusual with analytics in an environment with expensive joins like Hadoop.


Does this actually improve the performance of most queries? Most queries are light on computation and heavy in IO. Seems kind of like a waste of effort but maybe there are people with very complex / compute-heavy queries.


> Does this actually improve the performance of most queries?

"most queries" - probably not. By sheer number that's going to be OLTP queries, and there it doesn't help. You need analytics queries that take upwards of 0.2s or such to benefit.

> Most queries are light on computation and heavy in IO.

That's something often said. I think it's definitely wrong today and at least has been for a while. In a lot of workload a good chunk of the hot data set is in memory, and even a single decent SSD can often more saturate a single core.

If you look at analytics benchmarks and real world analytics usage, you'll often see CPU being the bottleneck. Using multiple cores can alleviate that to some degree, but that can imply a need for a bigger hardware / less concurrency. And doesn't come for free. Efficiency is important.


Depends on the workload and the database. Many workloads on modern hardware are more often limited by memory and network bandwidth than I/O per se. (This partly depends on database kernel architecture and presumes a modern design. Some database engines, particularly older designs, may have other bottlenecks.) Also, for scale-out databases you can compile the query once and ship the binary or IR to every relevant node, which amortizes the compilation cost.

For classic OLTP workloads, JIT compilation doesn't carry much benefit because you are (typically) directly targeting a few records. The primary benefit is for constraint matching in page scan operators. I would say the sweet spot is for low-latency operational analytics i.e. queries in mixed workload environments; the average dwell time of a query on a particular page has a big impact. These workloads (basically "real-time analytics") are increasingly popular.

Before I had experience designing systems with JIT compiled queries I worried quite a bit about overhead. The people with expertise that assured me the overhead would be pretty small if implemented well turned out to be correct, which allows you to JIT a lot of operations for material benefit.


Hope this will happen to plpgsql one day..


Very odd use of the word Landing.


"Landing a feature" is common way of phrasing that a feature has been added. For example, one might say that "C++17 support has landed in master", referring to the master branch for LLVM.



OK, we'll s/land/add/ the title above.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: