
Show HN: Speeding up PostgreSQL through vectorized execution - canguler
https://github.com/citusdata/postgres_vectorization_test
======
drob
I'm always surprised at how much room there is for big performance
improvements, even in a system as mature as PostgreSQL. Evaluating an agg node
is a pretty core piece of functionality, and there's still room for a pretty
big win here.

PostgreSQL is powerful and feature-rich, and there's still so much work to be
done. (Consider that index-only scans themselves are new as of 9.2!)

Very cool project.

~~~
yangyang
If you read through it, this is a change to the Citus cstore_fdw extension,
which stores data outside PostgreSQL and queries it using the PostgreSQL
Foreign Data Wrapper API (see
[https://wiki.postgresql.org/wiki/Foreign_data_wrappers](https://wiki.postgresql.org/wiki/Foreign_data_wrappers)).

So it's impressive, but it isn't a performance improvement for core
PostgreSQL, or data stored natively in PostgreSQL (in particular, cstore_fdw
is not very flexible - from the README for that project: "Note. We currently
don't support updating table using INSERT, DELETE, and UPDATE commands.")

~~~
drob
I did see that, although it's plausible that an analogous change would be
helpful in core postgres. The readme seemed to suggest that they tried this on
cstore_fdw because that was easier on which to develop, not because there was
more low-hanging fruit.

~~~
yangyang
It sounds like it was precisely because there was more low-hanging fruit.

I think column-store databases are far more amenable to vectorised processing.
I'm not aware of any row-store databases which do it, and most research into
it has been on column-store (or hybrid) systems.

------
yangyang
There are some other interesting PostgreSQL FDW extensions targeting speed.

PGStrom uses the GPU:
[https://wiki.postgresql.org/wiki/PGStrom](https://wiki.postgresql.org/wiki/PGStrom)

PGOpenCL:
[http://wiki.postgresql.org/images/6/65/Pgopencl.pdf](http://wiki.postgresql.org/images/6/65/Pgopencl.pdf)

PostgreSQL excels at heavy OLTP-type workloads; the MVCC architecture means
that readers don't block writers, and writers don't block readers. But there
is quite a lot of overhead for every row, and the query architecture is tuple-
at-a-time, with lots of function calls. This means that OLAP workloads can be
slow, compared to column store DBs like Vertica, MonetDB etc.

~~~
skyebook
Did PGOpenCL ever get released in any form? I've seen that PDF over and over
again, but never any product or code.

------
adamtj
I'm not sure that I understand the big picture here. Is my summary correct?

This optimization does not apply to Postgres in general, but only to
citrus_fdw and other column-store foreign data wrappers.

The Postgres execution engine was designed to operate on a row-store. Even so,
it can operate on a column-store via a foreign data wrapper like citrus_fdw.
However, to do so it must currently reconstruct rows-at-a-time to fit its row-
oriented nature. As a result, it cannot currently realize the potential
performance benefits of using a column-store for certain queries.

This article is about an extension that adds some new column-at-a-time
aggregate functions, then hooks into the Postgres executor to modify eligible
query plans to use them. It thus enables Postgres to take advantage of the
column-oriented nature of the citrus_fdw.

Is that an accurate way to look at it?

~~~
ozgune
Yes, it is. Can started by trying these ideas on Postgres' row store, and he
thought that we could get good benefits particularly when columns in the
database table had fixed-widths.

Then again, making changes for both fast column projections and aggregations
in Postgres didn't look too easy. Since we told him to go crazy in a short
amount of time, the best way to do that seemed through the columnar store.

------
mahmud
Impressive work for any engineer, but an _intern_!? Amazing.

~~~
alecco
They implemented ideas from a VectorWise paper. It's very good for an intern,
but it's not original work.

~~~
ozgune
(Ozgun from Citus Data here.)

I updated the Readme, and added a second link to the original vectorization
paper. We found that most benefits came from simply switching to batch-
processing; and that's why we referenced the more recent "vectorization vs
LLVM" paper.

I'd say the original part of Can's work is that new ideas can be applied to
PostgreSQL's robust and optimized executor, without touching any core logic.
In that sense, it could also be relevant to this earlier thread on cstore:
[https://news.ycombinator.com/item?id=7524886](https://news.ycombinator.com/item?id=7524886)

~~~
alecco
Yeah, I remember being downvoted without explanation.

~~~
ozgune
For clarification, it wasn't anyone from our team who downvoted any of these
threads.

All we can do on our part is to keep working on PostgreSQL to make it even
better and faster, and to share our work with others in the community.

------
Rapzid
@OP: A really good read and performance find. I wouldn't have expected those
kinds of gains, but I wouldn't have considered the impact on CPU optimization.
The improvements are fantastic! Could you speak a bit about how the vectorized
approach benefits more from CPU features? Does it work better with the cache?

Aside: This is a good concept to keep in mind when using channels in Go as
well. When you have an overhead to moving a single unit of work through a
process, and you're doing MANY of these shipments, there are savings to be had
from ensuring your unit of work is large(vector/slice of items vs a single
item).

~~~
jandrewrogers
Vectorization shifts the performance tradeoffs in the storage model. As a
practical matter, users design their applications around the performance
tradeoffs of database engines, so you really can't change your storage model
without alienating users.

There are roughly four different storage models in databases, with their own
acronyms: NSM (pure row structured storage), DSM (sorted columnar structured
storage), PAX (like DSM but where a row is stored within a single page), VSM
(array structured individual columns where every element in a row has the same
index). PAX and VSM are hybrid storage models. All four are used in real
commercial systems. Vector storage models (VSM) are extremely good for real-
time workloads; they sacrifice a little bit of insert performance for a major
improvement in query performance relative to NSM, but their insert/append
performance is much better than DSM or (to a lesser extent) PAX.

The reason vector models are extremely efficient in terms of query processing
boil down to a few facts. First, a column can be streamed into the CPU as a
sequential memory scan, which is very efficient. Second, constraints on
multiple columns can be trivially parallelized. Third, searches across
multiple columns/attributes can usually be constructed as a bitmap per column,
and then composed as simple AND/OR operations over those bitmaps, which is
extremely fast. Fourth, individual columns in this model are amenable to the
use of vector instructions for evaluation in the CPU, saving even more clock
cycles.

There is no ideal storage model, it really depends on the workload. Some
sophisticated databases will change the storage model to adapt to the likely
access patterns, such as NSM (great for a page with a lot of inserts) to PAX
or VSM (great for queries when a page is unlikely to change much).

~~~
ntoshev
The cstore_fdw extension uses the Optimized Row Columnar (ORC) format, which
seems to be PAX in your nomenclature. The improvements he accomplished seems
to be related just to the first "fact" that you listed, which applies to all
column stores.

~~~
jandrewrogers
I am not familiar with that particular implementation. PAX is not my
nomenclature, it is what sorted in-page columnar formats are called in the
literature. (It is what the first paper describing them called them.)

Not all columnar formats have great CPU streaming performance and even then it
is data type dependent. Many columnar formats have significantly worse
streaming performance than vector formats because only the latter is primarily
optimized for it and column stores will trade that for other things (like
format compressibility). And for data types that have no mathematical order of
any kind e.g. interval data types, columnar formats effectively degenerate
into very inefficient vector formats. It is still an open academic discussion.

------
clemsen
Great article. I am not a real programmer (but regular user of Postgres for
scientific data storage) and do not get every detail. But do these changes
have a chance of being implemented in a future version of Postgres or are
these improvements tied to CitusDB?

~~~
yangyang
I can't imagine this would ever get into core Postgres. It's a very specific
extension.

------
lee
So what's the likelihood that we'll see this optimization make it into
PostgreSQL core?

That would be an amazing performance improvement.

------
MHammond
Wonder if these improvements work in tandem with the in-memory columnar store
(incs) contributed by Konstantin Knizhnik.

~~~
yangyang
IMCS [1] already uses SIMD vector instructions.

[1]
[http://www.pgcon.org/2014/schedule/attachments/322_IMCS.pdf](http://www.pgcon.org/2014/schedule/attachments/322_IMCS.pdf)

------
ceyhunkazel
Well done Can! keep up your work.

~~~
bronson
Agreed, that's a great report. Your results sound encouraging, definitely
seems like a good place to find traction.

------
jedisct1
Very nice.

------
captainbenises
Cool post. I don't think this is a Show HN though.

~~~
DrJokepu
Are you kidding? Articles like this are my favourite posts on HN. I would love
to see more technical content like this and less softcore chit-chat.

