Hacker News new | past | comments | ask | show | jobs | submit login
PostgreSQL columnar store benchmarks on SSDs (citusdata.com)
120 points by il on June 4, 2014 | hide | past | favorite | 36 comments

This looks so good, that a question arises: Where's the catch? In other words, in which situations is a columnar DB a bad solution?

Its a bad solution for OLTP. C-stores are read optimized while a typical (row-oriented) rdbms is write optimized. Read optimized storage is typically accomplished via a write buffer that is flushed to a read store at a certain epoch.

Basically, if you need to be able to immediately read the data you just wrote, a cstore is a bad choice.

There's also in betweens. Cassandra for example is a bit of a hybrid. It writes quite well even though it has some columnar / big-table aspects.

Ben from CitusData here.

At a high level columnar stores are good for analytical workloads and not great for OLTP workloads. Some reasons for this:

* cstore_fdw currently only supports modifying data via the COPY command. This means that UPDATE and DELETE operations aren't yet possible which limits the utility for OLTP workloads.

* The skiplist indexing system works well for analytical workloads where you are aggregating or retrieving a significant number of rows but wouldn't be great for retrieving single rows.

Great, thanks. We have some OLAP datamarts in our app, I try your solution soon.

We use Vertica for our analytics, and basically, INSERT, UPDATE and DELETE are performance risks, as are trying to SELECT large amounts of entities.

As other commenters have said, Vertica, like other columnar DBs, is optimised for reads, and it does so basically through materialized views (called projections) across multiple nodes - so when underlying data is updated, those updates have to be replicated across the system. It's not a huge concern in a reporting / analysis environment though.

Retrieving whole entities is not particularly performant, but if you want to group and aggregate columns, it's blistering fast.

I ran some numbers - our largest stats table in Postgres is 64GB, 379 million rows, and summing a column can take up to 4.5 minutes. Our largest table in Vertica is 500GB, 2.7 billion rows, and summing a column can take up to 100 milliseconds.

I'd have to say that the worst thing about Vertica is that it's not free as in beer or as in speech. Licencing is done on a raw data basis, and the high costs of upgrading licences often drive our technical decisions. And likewise, trying to use the Vertica JDBC library to stream entities from Postgres to Vertica was an interesting exercise (in fairness to Vertica, it was caused by poor implementation of some code in the Postgres JDBC library), because I had only the compiled JARs to work with, and so debugging was difficult.

Other catches - constraints are not enforced - a performance decision. The other issue is that projection design is a very fine art and really determines how much performance you'll get. That said, Vertica ships with a projection designer that designs projections to answer specified queries fast, and the designer is suitable for about 98% of your needs. For the last 2%, you can often find yourself trying to interpret the mind of the query planner when you want it to use your new projection.

...This leads to another catch - as projection structure is strongly tied to query structure, queries that are changed and ad-hoc queries may suffer performance issues, and they do so silently. Vertica does make excellent records of query metrics, but nothing will blow up if an existing projection can no longer answer a query.

The last caveat is that importing data into Vertica requires some thought, and how you do it is typically determined by Vertica's design constraints. The documentation is good, and the training is worthwhile, but there's still a lot of thinking specific to the implementation required.

"basically through materialized views... summing a column can take up to 100 milliseconds"

That's a key point worth expanding on. This isn't so much a columnar/row distinction as it is a point about materializing results (or partial results) in a clever way that allows for fast processing for those kinds of queries.

That moves the problem: queries are faster (as long as they are in the sweet spot of queries that you expected), but you have extra cost at insert/update/delete time and extra effort when designing the "projections" (as you say).

Columnar makes a similar trade in that writes are slower and a class of "typical" grouping and aggregation queries are generally faster.

By combining the two strategies (materialization and columnar), Vertica is essentially doubling-down on that trade. So queries that match what you designed the projections for, and that follow typical patterns of grouping and aggregation, are blazingly fast (or so I've heard). But it's more restrictive about the kinds of queries that will work well, and it's sensitive to new kinds of queries that you might not have planned for when designing the projections (again: or so I've heard). In other words, it's more specialized.

Personally, I don't find ultra-specialized databases very interesting (speaking generally here, not necessarily saying that Vertica is ultra-specialized). I basically see two reasons to use something specialized: (1) replicate data into it and run queries (i.e. an accelerator); and (2) put lots of low value data in it, that you'd otherwise throw away, and try to extract some signal from it. (There are a couple other cases, like real-time systems, but that is outside of what I ordinarily deal with.)

Disclaimer: I work for a big data company that is not Vertica. Our products offer row or columnar as storage options, but that is not central to the products' value propositions. Opinions above are my own.

> Personally, I don't find ultra-specialized databases very interesting (speaking generally here, not necessarily saying that Vertica is ultra-specialized). I basically see two reasons to use something specialized: (1) replicate data into it and run queries (i.e. an accelerator); and (2) put lots of low value data in it, that you'd otherwise throw away, and try to extract some signal from it. (There are a couple other cases, like real-time systems, but that is outside of what I ordinarily deal with.)

Pretty much. We had a specific business requirement of near real-time statistics and historical statistics available at an entity / day level of granularity for advertisers and publishers using our platform.

Vertica occupies a very sweet spot for us in that regard, but it requires a lot of specialised approaches. I've managed to break its tuple mover a couple of times when loading historical raw data in large quantities.

Thanks for the really detailed explanation!

Have you tried monetdb (since your largest table is 500GB/1node?)?

I haven't, but I'll definitely look into it, cheers. :) Our largest table is only that size due to licensing issues, mind.

Anything non-OLAP. Using a columnar store for your transactional master, or as a document database would probably be disastrously bad, for example.

I wonder how these benchmarks would compare with citusdb + compression. Very interesting nonetheless.

Any comparisons against Vertica or other DBs that were designed to be columnar from the ground up?

Published benchmarks comparing commercial databases are difficult to come by due to the DeWitt Clause [1] in many commercial database licenses.

[1] http://en.wikipedia.org/wiki/David_DeWitt

If you are not familiar with Foreign Data Wrappers, they allow you to connect to other datastores and represent that data as tables in your database. http://wiki.postgresql.org/wiki/Foreign_data_wrappers

Does this support JOINs? Or do you use a giant WHERE IN () clause?

My use case is essentially a cross-database JOIN that I've been using MySQL & temp tables to accomplish. For example, give me the sum of column x if column y is any one of these 50,000 values from a separate system. So load the 50,000 values into a temp table and then do a JOIN. Performance isn't that great and it uses a ton of disk space so I wanted to try using a columnar store.

Seems like you might want to consider getting all of your data in the same place. A Foreign Data Wrapper feeding a Materialized View might be a neat way to solve your problem: http://www.postgresql.org/docs/9.3/interactive/rules-materia...

> Seems like you might want to consider getting all of your data in the same place

Unfortunately that's not possible

I'm very excited about this! Add a mechanism to distribute data and queries across a cluster, and this could be the makings of an open-source Amazon Redshift.

Actually CitusDB, which we also produce, does distribute data and queries across a PostgreSQL cluster. You can see some performance comparisons of using cstore_fdw on CitusDB in the blog post and the performance improvements are even more significant there.

It would be interesting to compare these benchmarks against the performance of Amazon's Redshift.

If the benchmark can be run without changes on Redshift would be my first question. There are some interesting differences that Redshift has rather than just being a columnar PostgreSQL protocol-speaking database. But if its possible, I'd be very interested to see the results.

Do the benchmarks for postgres utilize in memory columar store (IMCS)? What is the difference between postgres imcs and citus cstore_fdw? http://www.postgresql.org/message-id/52C59858.9090500@garret...

The benchmarks use cstore_fdw which is a columnar store that is accessed via the PostgreSQL foreign data wrapper system.

ICMS is a different implementation of a columnar store for PostgreSQL. I don't know it extremely well but my understanding is that it uses the PostgreSQL shared memory system to allocate memory for a columnar store. This columnar store is not queried via standard SQL commands but rather using a combination of a custom query language and some user defined functions.

Isn't the assumed tradeoff SSD storage for CPU usage? How much more cpu time is utilized in compressing/decompressing? And whats the unit cost of that extra CPU in comparison to the cost for disk space savings of 'expensive' SSD's?

SSDs are usually deployed in order to reduce (basically eliminate) random seek time, when compared to a platter hard drive. Ironically, this tends to make less of an impact in a column store, because a column store optimizes data arrangement on disk for reading large blocks of contiguous data, which platter drives are quite good at.

We tested SSD vs. HDD performance effects on our $COMMERCIAL_COLUMN_STORE here at work. SSDs were not worth the money.

CPU time is so ridiculously cheap compared to disk IO — even on SSD — that it's pretty much always a win, sometimes massively so.

My comment during a previous instance of this discussion:


I think you're comparing cheap in terms of time, Im talking about in terms of resource utilization and saturation. I think it would be clearer to show what the exact tradeoff is in terms of cost. In an arbitrary scenario where you have 100$ worth of ssd, and 100$ worth of CPU, and this strategy saves 75% of the SSD space at the cost of 1% more utilization of CPU, then yeah, its a total win. But isnt your example showing that it takes 5 times as much cpu time compressed vs uncompressed?

Just using my own example, it's roughly a 5x increase in CPU time for a > 10x reduction in disk space (and commensurate reduction in disk IO — including writes, which is an important consideration when you're using SSD). Let's just say 10x though, for sake of easier math.

Assume $1/gbyte storage costs. We're not talking about consumer-grade drives here, right? It's probably worse than that anyway, because you're using some form of RAID, too. Aren't you?

Assume further a $1000 8-core CPU, as that's what the machine in my example uses (it's actually slightly more expensive, but again: easier math), and a 3 year depreciation schedule. That's roughly 1000 days, or $1/day, $0.125/core-day, or $0.000087/core-second.

Storing that log file uncompressed costs you $8.817 in storage, and $0.00085 in CPU time, if the process uses 100% of a core for the duration. Storing it compressed costs you $0.828 in disk space and $0.0038 in CPU time — again, if you're burning an entire core for all 44s.

I think $0.8318 is less than $8.81785, but feel free to check my math...

EDIT: Yes, I know I'm conflating the ongoing cost of storage and the incidental cost of CPU time. I'm also ignoring the cost of power and cooling, leaving entirely aside the difference between storage milliseconds and CPU nanoseconds, and, and, and. I guess, if it's not obvious that compression is pretty much unequivocally a win with modern CPUs, then I don't know what else to say.

That math stands only if your server has spare CPU utilization.

I disagree. Being IO starved is far more likely, and has a vastly greater impact on the utility and capability of a host, than being CPU starved.

Sure IO starving is more likely, but on analytic DB (which is supposed use of showed tech) you probably will do some complex queries with joins, filtering, aggregation etc. That can load CPU pretty well. So you always need to check where the bottleneck is.

I couldn't find documentation about what subset of SQL you can use. I saw mention of "all supported Postgres data types", but not anything about what features work. Any links?

If you use cstore_fdw with PostgreSQL you can use the full PostgreSQL SQL dialect for querying your table. However, we don't currently support writing to your cstore_fdw table using SQL commands like INSERT/UPDATE/DELETE. PostgreSQL added foreign data wrapper support for these commands with their 9.3 release and it is something we're considering adding to cstore_fdw in the future.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact