Hacker News new | past | comments | ask | show | jobs | submit login
PostgreSQL Columnar Store for Analytic Workloads (citusdata.com)
190 points by tadruj on April 3, 2014 | hide | past | favorite | 57 comments

I've never really digged into column-oriented storage, so had a quick skim... Would the below excerpts/example be a fair note of pros/cons of the general idea?

> Column-oriented organizations are more efficient when an aggregate needs to be computed over many rows but only for a notably smaller subset of all columns of data, because reading that smaller subset of data can be faster than reading all data.

Example: SELECT sum(a) FROM things;

> Column-oriented organizations are more efficient when new values of a column are supplied for all rows at once, because that column data can be written efficiently and replace old column data without touching any other columns for the rows.

Example: UPDATE things SET a = a+1;

> Row-oriented organizations are more efficient when many columns of a single row are required at the same time, and when row-size is relatively small, as the entire row can be retrieved with a single disk seek.

Example: SELECT * FROM things;

> Row-oriented organizations are more efficient when writing a new row if all of the row data is supplied at the same time, as the entire row can be written with a single disk seek.

Example: INSERT INTO things (a,b,c,d,e,f,g) VALUES (1,2,3,4,5,6,7);

That's pretty accurate. Column-stores also make data compression significantly more effective because they are storing many values of the same type together [1].

The improvements to your first two points are typically 1-2 magnitudes faster with column-stores [2].

[1] http://db.lcs.mit.edu/projects/cstore/abadisigmod06.pdf

[2] http://db.csail.mit.edu/projects/cstore/abadi-sigmod08.pdf

That sounds like a pretty good summary.

From a workload perspective, row-stores are predominant in the OLTP (transactional insert/update) domain. They are also used in OLAP and data warehousing. Still, column stores have benefits when the underlying tables have many columns and the user's running analytic queries over a small subset of those columns.

Also column-oriented db are much more efficient in filtering rows, specially for very large datasets with complex filtering expressions. This is what makes them good at data analysis.

I consider different from your first point because sometimes is not the aggregation, but the drill down what is needed

Good work, and I'm impressed by the PostgreSQL foreign data wrapped API power to allow for such ease of implementation.

The .proto defined for cstore_fwd differs from the ORC as defined for Hive. At a quick glance I can't find references to dictionary encoding nor statistics, and the datatypes used are apparently the native PostgreSQL data types. From what I can tell this implementation leverages reduced IO (fetch only columns of interest from disk), segment elimination (use min/max info to skip over entire row groups) and pgzl compression for the data stream. I couldn't find references to run-length encoding (RLE) or dictionary encoding. I'm sure the shortcoming will be improved in future iterations, specially better encoding schemes which would result in better compression.

But I'm a bit disappointed that the ORC format used is not the same as the one originally used in Hive, Pig and the rest of Java/Hadoop ecosystem. Had it shared the actual file format it would had enabled a number of very interesting scenarios. Think about Hive/PIG serving as ETL pipelines to produce ORC file that are attached directly to PostgreSQL (add today's partition as FOREIGN DATA WRAPPER ... OPTIONS (filename '.../4_4_2014.cstore') and then do interactive analytic driven by PostgreSQL. It would reduce the cost of ingress significantly (fast attach, no need for expensive INSERT and re-compression). I realize data type translation would had been thorny, to say the least (datetime and decimal as primitives, probably many of the structured types too).

- github.com/citusdata/cstore_fdw/blob/master/cstore.proto - github.com/apache/hive/blob/trunk/ql/src/protobuf/org/apache/hadoop/hive/ql/io/orc/orc_proto.proto

I think the emphasis should be on the fact that Postgres now has a free and open source columnar store. For many years (probably decades) there have been companies that have developed "big data"/analytics system with Postgres as the base, but have not contributed back to the Postgres ecosystem.

While this new columnar store is not as speedy as ones that have been around since the early 2000s, it does give a platform for CitusData and other companies to build on and share solutions.

Having a DB that can hold both the transactional data and data for fast analytical purposes is very advantageous as you have less moving parts and much less ETL work.

What I am looking forward to now is a few start ups similar to CitusData that solve different "Big Data" problems and for them to work together to disrupt the multi-billion dollar datawarehouse/analytics vendors.

This is potentially very interesting, it's the only open source SQL columnar store I'm aware of. Bonus points for being part of Postgres.

Basically ParAccel (like used by Amazon RedShift) at reasonable cost.

InfinyDB did this with MySQL a long time ago http://infinidb.co/products

There is MonetDB: https://www.monetdb.org/

I use it with great success.

And it's been around forever. If you're interested in high-speed processing, there's lots of good papers from that project (e.g. how to avoid creating branches, stuff like that).

I agree, I'm excited. Hopefully the optimizer can take full advantage of these column stores. However, postgresql isn't the only open source column store. MonetDB is another one. It is an academic project but people do use it commercially. I know some people who get great performance out of MonetDB, even though its optimizer has some gaps.

https://www.infobright.org/ has been around for ages.

Virtioso 7 by openlinksw is a GPL columnar sql (and sparql) RDBMS

Don't you still need to purchase citus DB to use this?

Nope, you don't need to purchase CitusDB.

We open sourced cstore_fdw with the Apache license. You can just build and load the extension, and start using it with your PostgreSQL 9.3+ database.

If you later need to scale out your PostgreSQL cluster, just please let us know. :)

I know some statisticians / analyst minded people your generous contribution may make extremely happy. Myself as well of course.

Thank you very very much in advance. This is a really huge deal. Even for general things like academic research. We shall see.

Very cool! Thanks for contributing this.

Any particular reason why ORC was chosen as the columnar store format over Parquet (https://github.com/Parquet/parquet-format)? Reason I ask is because Parquet seems to have its own development cycle, roadmap, and is pretty continuously updated with enhancements.

I'd be interested to see more benchmarks. The improvements in this post are not anywhere close to what we've seen going from PG to Infobright for our reporting queries - we get speedups from 10x - 1000x, the one speed benchmark they have here is only 2x.

(Ozgun from Citus Data here.)

We're going to publish more comprehensive results in a few weeks. In our initial tests, we found the speed up to depend a lot on the underlying data and the type of queries.

For example, we found that when compression reduced the working set from being on disk to in-memory, there was a significant jump.

Also, we're looking to do optimizations on the cost estimation side -- these will notably help with queries that join multiple tables together, a common scenario for the TPC-H benchmark.

Awesome, can't wait to see more benchmarks. I would love to be able to switch back to 100% Postgres. IB gives us tons of speed but even ignoring the MySQL warts it just feels incomplete. And we ran into a bug in production where the order of AND was actually having an effect on query results, something that's completely unacceptable and makes me worry about my data.

Doesn't look like it uses any kind of vectorized execution. So the only advantage one gets is reduced amount of data to process. It also doesn't mention execution on compressed data so it has to decompress the data which doesn't come cheap.

A relatively easy way to change that would be to use Intel's ISPC compiler.


Because they still use PostgreSQL engine, very slow.

There is a great talk by Stonebraker about this. (He created PostgreSQL about 3 decades ago and moved on to newer database engines).

He started Postgres before it became PostgreSQL (i.e. before it even supported SQL). It's quite a different system now.

The PostgreSQL Stonebraker left was a very different one than the current. It did not even have MVCC.

Is this talk online?

How is this different than the many other columnar SQL databases and extensions?

Columnar querying is typical for OLAP, PostgreSQL engine is aimed at OLTP. This doesn't look like a good idea. Like adding side floats to a car and paddles to use it like a boat.

This goes against using the right tool for the right job.

How is 'jsonb' better than mongodb ? Because you can use the same tool, you'll support the same db, you'll pay 0$ for licensing, you may pay 0$ for sharding (postgresql-xc) etc.

Downvotes? Lovely.

Also, I bet VoltDB, a modern open source OLTP, can beat this thing hands down. Also in-memory and clusters. Complex store procedures precompiled and many other goodies.

Commercial column stores like Vertica should be orders of magnitude faster.

Don't know why you are being down voted, but you should understand its about tradeoffs.

If I want to increase my database performance I can either

1.) Build this plugin and integrate it into my already working ecosystem 2.) Spend time researching, testing, and deploying VoltDB.

Given the popularity of postgres, and the relatively low friction solution of (1.), its clear why this could be an adequate solution. Sure you won't be as fast as VoltDB, but as an outside engineer we don't know the potential customers requirements, and if being as fast as VoltDB actually matters.

This is really exciting! Columnar storage is something that the big boys like Microsoft and Oracle charge an arm and a leg for. You can currently get column-based SQL open-source databases, but this new FDW allows you to mix workloads on the same machine.

I am curious about how this thing compares to something like Amazon Redshift.

Briefly skimming, it looks pretty similar, except for the data compression part, which uses RCFile. It also supports more data types. If this being adapted by redshift or something else, I will be thrilled.

So with Cassandra you have a pretty nice, scalable, columnar DB with a SQL interface[0]. Not to mention, it's free and apache licensed so you can distribute it as part of your own software. I guess I've only looked at cassandra from the scope of a developer. Would a DBA prefer using a columnar version of PostgreSQL rather than using cassandra for free?


Edit: I didn't realize Citus Data was making the columnar postgres offering open source...that's great!

Cassandra does not have a SQL interface. CQL (Cassandra Query Language) provides none of the aggregation, grouping and dearth of other features an analyst would depend on from a SQL-like database.

Not to say there aren't tools to get SQL ontop of Cassandra (Datastax has a hadoop/mr driver, that you can probably put pig/hive/presto(?) on top of).

As pointed out by another commenter, Cassandra is not a column store. Its storage format is quite inefficient for compression and for large scans of subsets of columns, which is what columnar stores are designed for (Cassandra and HBase optimize for different use cases, which are much closer to OLTP).

You can find a brief description of why and when you want actual columnar formats in the introduction section of this blog post: https://blog.twitter.com/2013/dremel-made-simple-with-parque... .

You can read more about columnar databases here: https://en.wikipedia.org/wiki/Column-oriented_DBMS

A columnar storage format for a database engine that does not have operations specifically tuned for vectorization and taking advantage of data being cheaply available a column chunk at a time instead of a row chunk at a time is only half the battle -- you can get massive perf wins on top of the work CitusDB did by doing things like operating on compressed data / delaying decompression, avoiding deserialization for column values based on a filtering scan of another column, pipelining operations on column values to fit better into your cpu architecture, etc.

But just having the IO wins is nice, too.

Cassandra is a Bigtable style column store not an analytic column store.

It's an unfortunate naming collision.

I think cassandra is pretty good for analytical workloads...for instance "Give me all the customers from pittsburgh" is a faster scan when the city column is laid out contiguously on disk.

Cassandra doesn't store columns from different rows together to my knowledge. That means it is no way like an analytic column store where columns from all rows are stored together. This is not a bad thing, it just wasn't intended to be an analytic column store and makes design choices suited for other things.

See http://wiki.apache.org/cassandra/ArchitectureSSTable

Analytic column stores can exploit storing columns separately by using various encoding schemes to compress data that are far superior to generalized compression and they can use the sort to zero on relevant ranges for each column.

Ah ok. I understand why a columnar database would be better for analytical workloads, but I didn't realize that HBase and Cassandra aren't actually laid out in a columnar fashion on disk. I was confused by the 'column store' terminology. This makes Citus data's release even more compelling.

Thanks for clearing that up!

can you kindly provide at a high level the differences between the two?

For Cassandra see http://www.datastax.com/docs/0.8/ddl/index

An analytic column store like say Vertica has a schema like a regular SQL database. I don't know what their flexible schema story is right now.

Instead of storing the columns of a row together an analytic column store will store columns from many rows together in sorted runs. When you go to do a scan your disk will only read the columns you have selected. The format for column storage is optimized for specific types and uses type specific compression so 10-50x is something that is claimed. This further improves the IO situation. They can also zero in on relevant ranges of data for each column because they are indexed and this further reduces the IO requirements.

Where other databases are bound on seeks or sequential throughput an analytic column store will be bound on CPU, especially CPU for the non-parallel portions of every query.

Obviously a column store will have a hard time selecting individual rows because the data is not stored together so it will be expensive to materialize. They also have trouble with updates/deletes to already inserted data, in some cases requiring the data be reloaded because updates have dragged everything down.

>We are excited to open source our columnar store extension for PostgreSQL and share it with the community


I have written a short blog post on how CitusDB's column store compares with purpose-built systems such as MonetDB. Repeatability ftw. Disclosure: I am part of the MonetDB development team.

CitusDB vs. MonetDB TPC-H Shootout http://www.monetdb.org/content/citusdb-postgresql-column-sto...

I am curious, why are foreign tables necessary?

(Ozgun from Citus Data here.)

Foreign tables provide a nice abstraction to separate the storage layer from the rest of the database. They are similar to MySQL storage engines in a sense. With them, users don't need to make any changes to their existing PostgreSQL databases, and instead just Create Extension and get rolling.

You can also use both foreign and regular tables in the same query: http://citusdata.github.io/cstore_fdw/#toc_4

My question is what do you lose by doing so? FDW has always felt like a little bit of a hack, but I guess if the optimizer and execution engines can use it as if they were native tables, then there isn't much lost. But the fact that ProtoBuf is used makes me think that there is some overhead that doesn't occur in native tables.

I could be a bit off here, but from what I remember about working with FDWs I think the biggest drawback is that you can't assume native access to system tables on a remote server, which means you can't do things like CREATE INDEX on a foreign table and aren't going to have, e.g., statistics about those tables. However, it does have pretty much all the information the optimizer does about the actual query, and if you're implementing your own storage engine like Citus rather than trying to hook into someone else's you can probably get around those problems.

You're correct that foreign tables do not allow explicit index creation (though cstore_fdw does use skip indexes internally), but ANALYZE can ask them for statistics: http://www.postgresql.org/docs/current/static/sql-analyze.ht...

cstore_fdw supports ANALYZE so these statistics can indeed be used by the planner; however, the autovacuum daemon doesn't do this automatically for foreign tables, so it's up to the user to decide how often to run ANALYZE on them.

That's pretty cool, I had no idea that was supported :)

Makes sense, thanks for the explanation. Mixing table types is a pretty nice feature.

Likely because you have to run this as a separate db server. The speed improvements come from adjusting data locality on disk so they're probably not able to have columnar & regular data stores in the same PG instance/data files.

Similar to how Infobright is just MySQL but you still can't mix columnar and regular DBs in the same instance, you have to run IB separately.

(I'm Jason from Citus Data)

As Ozgun notes above, the Foreign Data Wrapper APIs just provide a powerful abstraction for what this extension is doing.

Though the DDL to set up a foreign table requires first creating a "server" with CREATE SERVER, this is merely a formality: as in file_fdw the server doesn't actually represent an external process. All I/O occurs within the postgres backend process servicing the query.

This sounds very interesting. Are there any pre-built Windows binaries available for this extension?

You'd have to mangage backups of the storage file separately than backups of postgresql, right?

Any chance it will support INSERT some day?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact