Columnar Store for Analytics with Postgres

lima · on Oct 12, 2017

Have a look at ClickHouse for OLAP workloads:

It's a recently open sourced database by Yandex. It powers their web analytics backend. CloudFlare is already using it for their DNS analytics.

polskibus · on Oct 12, 2017

Do you know if it is an in-memory solution or can it work with data that doesn't fit into RAM?

lima · on Oct 12, 2017

Yes, in fact, it's specifically designed for data that doesn't fit in RAM. Their storage engine optimizes data locality for fast retrieval from non-flash storage.

See: https://clickhouse.yandex/presentations/data_at_scale/

Their docs also talk about it a lot.

kermatt · on Oct 12, 2017

Its a disk based database, so not RAM limited.

eis · on Oct 12, 2017

Have a look at the page. It can do both.

drej · on Oct 12, 2017

Has anyone achieved a dramatic speedup using this extension? Last time I tried it, I got barely a 30% speedup in a generic aggregation across a few fairly normalised tables with ~10M rows. I don't know if I'm doing something wrong or if this is all I'm getting. Thanks!

pjungwir · on Oct 12, 2017

I've tried it a couple times but didn't see as dramatic a speedup as I'd hoped either. What has worked for me is storing a full array in a single row of a regular Postgres table. Then if you have to load e.g. 1 million FLOATs, you can just slurp an 8MB chunk off disk instead of looking in 1 million different places. Postgres doesn't have a lot of functions for working on that kind of data though, so I wrote an extension to provide a few at [0]. I will probably add some more functions soon around masking out values so you can work "R/Pandas-style" on these arrays.

My approach breaks down if you are heavily updating your arrays though. On the mailing list Tom Lane suggested I store that kind of data outside the database.[1] But of course all software development is an attempt to have your cake and eat it too, so I wrote an extension that is a bit like cstore_fdw, but maybe more extreme: you can store/load/append a float array as an ordinary file.[2] It has similar operational disadvantages to cstore_fdw (no replication, no pg_dump, etc.) but performs faster. On the other hand it is definitely not battle-tested and still likely to change a little. (I need to improve the crash story and may add compression.) If you decide to use it, let me know!

Oh another limitation of keeping everything in an array (whether with floatfile or not) is that Postgres has a 1 GB limit on most RAM allocations. For example you could try running array_agg on a giant table and you might get a message about `Exception invalid memory alloc request size 1073741824`. I don't think cstore_fdw has that problem.

[0] https://github.com/pjungwir/aggs_for_arrays/

[1] http://www.postgresql-archive.org/Performance-appending-to-a...

[2] https://github.com/pjungwir/floatfile/

mtuncer · on Oct 12, 2017

replication support is recently added to cstore_fdw. You can replicate a cstore_fdw table to another server using streaming replication.

metdos · on Oct 12, 2017

This actually depends on your use case and type of your queries. There is a blog post (2014) about a benchmark: https://www.citusdata.com/blog/2014/06/14/columnar-store-ben...

I also believe that the main use case is for the compression of data more than the performance benefits.

metdos · on Oct 12, 2017

If you have a common filter, it is a good idea to sort your data on that field and load into cstore_fdw. This will help to use skip indexes. https://github.com/citusdata/cstore_fdw#using-skip-indexes