Basically, if you need to be able to immediately read the data you just wrote, a cstore is a bad choice.
At a high level columnar stores are good for analytical workloads and not great for OLTP workloads. Some reasons for this:
* cstore_fdw currently only supports modifying data via the COPY command. This means that UPDATE and DELETE operations aren't yet possible which limits the utility for OLTP workloads.
* The skiplist indexing system works well for analytical workloads where you are aggregating or retrieving a significant number of rows but wouldn't be great for retrieving single rows.
As other commenters have said, Vertica, like other columnar DBs, is optimised for reads, and it does so basically through materialized views (called projections) across multiple nodes - so when underlying data is updated, those updates have to be replicated across the system. It's not a huge concern in a reporting / analysis environment though.
Retrieving whole entities is not particularly performant, but if you want to group and aggregate columns, it's blistering fast.
I ran some numbers - our largest stats table in Postgres is 64GB, 379 million rows, and summing a column can take up to 4.5 minutes. Our largest table in Vertica is 500GB, 2.7 billion rows, and summing a column can take up to 100 milliseconds.
I'd have to say that the worst thing about Vertica is that it's not free as in beer or as in speech. Licencing is done on a raw data basis, and the high costs of upgrading licences often drive our technical decisions. And likewise, trying to use the Vertica JDBC library to stream entities from Postgres to Vertica was an interesting exercise (in fairness to Vertica, it was caused by poor implementation of some code in the Postgres JDBC library), because I had only the compiled JARs to work with, and so debugging was difficult.
Other catches - constraints are not enforced - a performance decision. The other issue is that projection design is a very fine art and really determines how much performance you'll get. That said, Vertica ships with a projection designer that designs projections to answer specified queries fast, and the designer is suitable for about 98% of your needs. For the last 2%, you can often find yourself trying to interpret the mind of the query planner when you want it to use your new projection.
...This leads to another catch - as projection structure is strongly tied to query structure, queries that are changed and ad-hoc queries may suffer performance issues, and they do so silently. Vertica does make excellent records of query metrics, but nothing will blow up if an existing projection can no longer answer a query.
The last caveat is that importing data into Vertica requires some thought, and how you do it is typically determined by Vertica's design constraints. The documentation is good, and the training is worthwhile, but there's still a lot of thinking specific to the implementation required.
That's a key point worth expanding on. This isn't so much a columnar/row distinction as it is a point about materializing results (or partial results) in a clever way that allows for fast processing for those kinds of queries.
That moves the problem: queries are faster (as long as they are in the sweet spot of queries that you expected), but you have extra cost at insert/update/delete time and extra effort when designing the "projections" (as you say).
Columnar makes a similar trade in that writes are slower and a class of "typical" grouping and aggregation queries are generally faster.
By combining the two strategies (materialization and columnar), Vertica is essentially doubling-down on that trade. So queries that match what you designed the projections for, and that follow typical patterns of grouping and aggregation, are blazingly fast (or so I've heard). But it's more restrictive about the kinds of queries that will work well, and it's sensitive to new kinds of queries that you might not have planned for when designing the projections (again: or so I've heard). In other words, it's more specialized.
Personally, I don't find ultra-specialized databases very interesting (speaking generally here, not necessarily saying that Vertica is ultra-specialized). I basically see two reasons to use something specialized: (1) replicate data into it and run queries (i.e. an accelerator); and (2) put lots of low value data in it, that you'd otherwise throw away, and try to extract some signal from it. (There are a couple other cases, like real-time systems, but that is outside of what I ordinarily deal with.)
Disclaimer: I work for a big data company that is not Vertica. Our products offer row or columnar as storage options, but that is not central to the products' value propositions. Opinions above are my own.
Pretty much. We had a specific business requirement of near real-time statistics and historical statistics available at an entity / day level of granularity for advertisers and publishers using our platform.
Vertica occupies a very sweet spot for us in that regard, but it requires a lot of specialised approaches. I've managed to break its tuple mover a couple of times when loading historical raw data in large quantities.
My use case is essentially a cross-database JOIN that I've been using MySQL & temp tables to accomplish. For example, give me the sum of column x if column y is any one of these 50,000 values from a separate system. So load the 50,000 values into a temp table and then do a JOIN. Performance isn't that great and it uses a ton of disk space so I wanted to try using a columnar store.
Unfortunately that's not possible
If the benchmark can be run without changes on Redshift would be my first question. There are some interesting differences that Redshift has rather than just being a columnar PostgreSQL protocol-speaking database. But if its possible, I'd be very interested to see the results.
ICMS is a different implementation of a columnar store for PostgreSQL. I don't know it extremely well but my understanding is that it uses the PostgreSQL shared memory system to allocate memory for a columnar store. This columnar store is not queried via standard SQL commands but rather using a combination of a custom query language and some user defined functions.
We tested SSD vs. HDD performance effects on our $COMMERCIAL_COLUMN_STORE here at work. SSDs were not worth the money.
My comment during a previous instance of this discussion:
Assume $1/gbyte storage costs. We're not talking about consumer-grade drives here, right? It's probably worse than that anyway, because you're using some form of RAID, too. Aren't you?
Assume further a $1000 8-core CPU, as that's what the machine in my example uses (it's actually slightly more expensive, but again: easier math), and a 3 year depreciation schedule. That's roughly 1000 days, or $1/day, $0.125/core-day, or $0.000087/core-second.
Storing that log file uncompressed costs you $8.817 in storage, and $0.00085 in CPU time, if the process uses 100% of a core for the duration. Storing it compressed costs you $0.828 in disk space and $0.0038 in CPU time — again, if you're burning an entire core for all 44s.
I think $0.8318 is less than $8.81785, but feel free to check my math...
EDIT: Yes, I know I'm conflating the ongoing cost of storage and the incidental cost of CPU time. I'm also ignoring the cost of power and cooling, leaving entirely aside the difference between storage milliseconds and CPU nanoseconds, and, and, and. I guess, if it's not obvious that compression is pretty much unequivocally a win with modern CPUs, then I don't know what else to say.