Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Speeding up PostgreSQL through vectorized execution (github.com/citusdata)
332 points by canguler on Oct 7, 2014 | hide | past | favorite | 36 comments



I'm always surprised at how much room there is for big performance improvements, even in a system as mature as PostgreSQL. Evaluating an agg node is a pretty core piece of functionality, and there's still room for a pretty big win here.

PostgreSQL is powerful and feature-rich, and there's still so much work to be done. (Consider that index-only scans themselves are new as of 9.2!)

Very cool project.


If you read through it, this is a change to the Citus cstore_fdw extension, which stores data outside PostgreSQL and queries it using the PostgreSQL Foreign Data Wrapper API (see https://wiki.postgresql.org/wiki/Foreign_data_wrappers).

So it's impressive, but it isn't a performance improvement for core PostgreSQL, or data stored natively in PostgreSQL (in particular, cstore_fdw is not very flexible - from the README for that project: "Note. We currently don't support updating table using INSERT, DELETE, and UPDATE commands.")


I did see that, although it's plausible that an analogous change would be helpful in core postgres. The readme seemed to suggest that they tried this on cstore_fdw because that was easier on which to develop, not because there was more low-hanging fruit.


It sounds like it was precisely because there was more low-hanging fruit.

I think column-store databases are far more amenable to vectorised processing. I'm not aware of any row-store databases which do it, and most research into it has been on column-store (or hybrid) systems.


It's not so much of a surprise for postgres, after all the team only started working on performance in the 7.x or 8.x cycle, before that Postgres was known for a "correctness first, speed… maybe?" approach.


Well, you know, until recently, 'my data fits in RAM' just didn't happen at a scale worth optimizing for.


Sure, but that itself is interesting: context around data systems is changing so fast that even the mature ones have to innovate pretty quickly to keep up. (Or maybe especially the mature ones?)

I wonder how long before postgres includes an index type optimized for in-memory workloads, e.g. a skiplist.


There are some other interesting PostgreSQL FDW extensions targeting speed.

PGStrom uses the GPU: https://wiki.postgresql.org/wiki/PGStrom

PGOpenCL: http://wiki.postgresql.org/images/6/65/Pgopencl.pdf

PostgreSQL excels at heavy OLTP-type workloads; the MVCC architecture means that readers don't block writers, and writers don't block readers. But there is quite a lot of overhead for every row, and the query architecture is tuple-at-a-time, with lots of function calls. This means that OLAP workloads can be slow, compared to column store DBs like Vertica, MonetDB etc.


Did PGOpenCL ever get released in any form? I've seen that PDF over and over again, but never any product or code.


The only thing I can find about PGOpenCL after a quick search are slides. Is there anything more ?


I'm not sure that I understand the big picture here. Is my summary correct?

This optimization does not apply to Postgres in general, but only to citrus_fdw and other column-store foreign data wrappers.

The Postgres execution engine was designed to operate on a row-store. Even so, it can operate on a column-store via a foreign data wrapper like citrus_fdw. However, to do so it must currently reconstruct rows-at-a-time to fit its row-oriented nature. As a result, it cannot currently realize the potential performance benefits of using a column-store for certain queries.

This article is about an extension that adds some new column-at-a-time aggregate functions, then hooks into the Postgres executor to modify eligible query plans to use them. It thus enables Postgres to take advantage of the column-oriented nature of the citrus_fdw.

Is that an accurate way to look at it?


Yes, it is. Can started by trying these ideas on Postgres' row store, and he thought that we could get good benefits particularly when columns in the database table had fixed-widths.

Then again, making changes for both fast column projections and aggregations in Postgres didn't look too easy. Since we told him to go crazy in a short amount of time, the best way to do that seemed through the columnar store.


Yes.

Also, buried inside is the reference of the original work by VectorWise this work was based on. "Vectorization vs. Compilation in Query Execution". They don't even mention the name. Not nice IMHO.


Impressive work for any engineer, but an intern!? Amazing.


They implemented ideas from a VectorWise paper. It's very good for an intern, but it's not original work.


(Ozgun from Citus Data here.)

I updated the Readme, and added a second link to the original vectorization paper. We found that most benefits came from simply switching to batch-processing; and that's why we referenced the more recent "vectorization vs LLVM" paper.

I'd say the original part of Can's work is that new ideas can be applied to PostgreSQL's robust and optimized executor, without touching any core logic. In that sense, it could also be relevant to this earlier thread on cstore: https://news.ycombinator.com/item?id=7524886


Yeah, I remember being downvoted without explanation.


For clarification, it wasn't anyone from our team who downvoted any of these threads.

All we can do on our part is to keep working on PostgreSQL to make it even better and faster, and to share our work with others in the community.


This was my feeling, as well.


Interns are great. Some of them are PhD candidates or even post-docs. Just because they are looking for short-term employment doesn't make them underqualified. They also lack any responsibilities that are heaped on full-time engineers, like managing or mentoring junior engineers (and interns), attending meetings, etc. They focus on their projects.


Also, it's easier for them to get more speculative work because the cost of failure is lower.


@OP: A really good read and performance find. I wouldn't have expected those kinds of gains, but I wouldn't have considered the impact on CPU optimization. The improvements are fantastic! Could you speak a bit about how the vectorized approach benefits more from CPU features? Does it work better with the cache?

Aside: This is a good concept to keep in mind when using channels in Go as well. When you have an overhead to moving a single unit of work through a process, and you're doing MANY of these shipments, there are savings to be had from ensuring your unit of work is large(vector/slice of items vs a single item).


Vectorization shifts the performance tradeoffs in the storage model. As a practical matter, users design their applications around the performance tradeoffs of database engines, so you really can't change your storage model without alienating users.

There are roughly four different storage models in databases, with their own acronyms: NSM (pure row structured storage), DSM (sorted columnar structured storage), PAX (like DSM but where a row is stored within a single page), VSM (array structured individual columns where every element in a row has the same index). PAX and VSM are hybrid storage models. All four are used in real commercial systems. Vector storage models (VSM) are extremely good for real-time workloads; they sacrifice a little bit of insert performance for a major improvement in query performance relative to NSM, but their insert/append performance is much better than DSM or (to a lesser extent) PAX.

The reason vector models are extremely efficient in terms of query processing boil down to a few facts. First, a column can be streamed into the CPU as a sequential memory scan, which is very efficient. Second, constraints on multiple columns can be trivially parallelized. Third, searches across multiple columns/attributes can usually be constructed as a bitmap per column, and then composed as simple AND/OR operations over those bitmaps, which is extremely fast. Fourth, individual columns in this model are amenable to the use of vector instructions for evaluation in the CPU, saving even more clock cycles.

There is no ideal storage model, it really depends on the workload. Some sophisticated databases will change the storage model to adapt to the likely access patterns, such as NSM (great for a page with a lot of inserts) to PAX or VSM (great for queries when a page is unlikely to change much).


The cstore_fdw extension uses the Optimized Row Columnar (ORC) format, which seems to be PAX in your nomenclature. The improvements he accomplished seems to be related just to the first "fact" that you listed, which applies to all column stores.


I am not familiar with that particular implementation. PAX is not my nomenclature, it is what sorted in-page columnar formats are called in the literature. (It is what the first paper describing them called them.)

Not all columnar formats have great CPU streaming performance and even then it is data type dependent. Many columnar formats have significantly worse streaming performance than vector formats because only the latter is primarily optimized for it and column stores will trade that for other things (like format compressibility). And for data types that have no mathematical order of any kind e.g. interval data types, columnar formats effectively degenerate into very inefficient vector formats. It is still an open academic discussion.


Take a look at MonetDB, an open source column store database [1]. There was a commercialised extension to that called X100 targeting cache efficiency [2]. That became VectorWise [3].

[1] http://www.monetdb.org

[2] http://www.cidrdb.org/cidr2005/papers/P19.pdf

[3] http://en.wikipedia.org/wiki/Vectorwise


Great article. I am not a real programmer (but regular user of Postgres for scientific data storage) and do not get every detail. But do these changes have a chance of being implemented in a future version of Postgres or are these improvements tied to CitusDB?


I can't imagine this would ever get into core Postgres. It's a very specific extension.


So what's the likelihood that we'll see this optimization make it into PostgreSQL core?

That would be an amazing performance improvement.


Wonder if these improvements work in tandem with the in-memory columnar store (incs) contributed by Konstantin Knizhnik.


IMCS [1] already uses SIMD vector instructions.

[1] http://www.pgcon.org/2014/schedule/attachments/322_IMCS.pdf


Well done Can! keep up your work.


Agreed, that's a great report. Your results sound encouraging, definitely seems like a good place to find traction.


Very nice.


Cool post. I don't think this is a Show HN though.


Are you kidding? Articles like this are my favourite posts on HN. I would love to see more technical content like this and less softcore chit-chat.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: