
Columnar Store for Analytics with Postgres - rdegges
https://github.com/citusdata/cstore_fdw
======
lima
Have a look at ClickHouse for OLAP workloads:

[https://clickhouse.yandex/](https://clickhouse.yandex/)

It's a recently open sourced database by Yandex. It powers their web analytics
backend. CloudFlare is already using it for their DNS analytics.

~~~
polskibus
Do you know if it is an in-memory solution or can it work with data that
doesn't fit into RAM?

~~~
lima
Yes, in fact, it's specifically designed for data that doesn't fit in RAM.
Their storage engine optimizes data locality for fast retrieval from non-flash
storage.

See:
[https://clickhouse.yandex/presentations/data_at_scale/](https://clickhouse.yandex/presentations/data_at_scale/)

Their docs also talk about it a lot.

------
drej
Has anyone achieved a dramatic speedup using this extension? Last time I tried
it, I got barely a 30% speedup in a generic aggregation across a few fairly
normalised tables with ~10M rows. I don't know if I'm doing something wrong or
if this is all I'm getting. Thanks!

~~~
pjungwir
I've tried it a couple times but didn't see as dramatic a speedup as I'd hoped
either. What has worked for me is storing a full array in a single row of a
regular Postgres table. Then if you have to load e.g. 1 million FLOATs, you
can just slurp an 8MB chunk off disk instead of looking in 1 million different
places. Postgres doesn't have a lot of functions for working on that kind of
data though, so I wrote an extension to provide a few at [0]. I will probably
add some more functions soon around masking out values so you can work
"R/Pandas-style" on these arrays.

My approach breaks down if you are heavily updating your arrays though. On the
mailing list Tom Lane suggested I store that kind of data outside the
database.[1] But of course all software development is an attempt to have your
cake and eat it too, so I wrote an extension that is a bit like cstore_fdw,
but maybe more extreme: you can store/load/append a float array as an ordinary
file.[2] It has similar operational disadvantages to cstore_fdw (no
replication, no pg_dump, etc.) but performs faster. On the other hand it is
definitely not battle-tested and still likely to change a little. (I need to
improve the crash story and may add compression.) If you decide to use it, let
me know!

Oh another limitation of keeping everything in an array (whether with
floatfile or not) is that Postgres has a 1 GB limit on most RAM allocations.
For example you could try running array_agg on a giant table and you might get
a message about `Exception invalid memory alloc request size 1073741824`. I
don't think cstore_fdw has that problem.

[0]
[https://github.com/pjungwir/aggs_for_arrays/](https://github.com/pjungwir/aggs_for_arrays/)

[1] [http://www.postgresql-archive.org/Performance-appending-
to-a...](http://www.postgresql-archive.org/Performance-appending-to-an-array-
column-td5984739.html)

[2]
[https://github.com/pjungwir/floatfile/](https://github.com/pjungwir/floatfile/)

~~~
mtuncer
replication support is recently added to cstore_fdw. You can replicate a
cstore_fdw table to another server using streaming replication.

