Hacker News new | comments | show | ask | jobs | submit login
Columnar Store for Analytics with Postgres (github.com)
39 points by rdegges 10 days ago | hide | past | web | 10 comments | favorite

Have a look at ClickHouse for OLAP workloads:


It's a recently open sourced database by Yandex. It powers their web analytics backend. CloudFlare is already using it for their DNS analytics.

Do you know if it is an in-memory solution or can it work with data that doesn't fit into RAM?

Yes, in fact, it's specifically designed for data that doesn't fit in RAM. Their storage engine optimizes data locality for fast retrieval from non-flash storage.

See: https://clickhouse.yandex/presentations/data_at_scale/

Their docs also talk about it a lot.

Its a disk based database, so not RAM limited.

Have a look at the page. It can do both.

Has anyone achieved a dramatic speedup using this extension? Last time I tried it, I got barely a 30% speedup in a generic aggregation across a few fairly normalised tables with ~10M rows. I don't know if I'm doing something wrong or if this is all I'm getting. Thanks!

I've tried it a couple times but didn't see as dramatic a speedup as I'd hoped either. What has worked for me is storing a full array in a single row of a regular Postgres table. Then if you have to load e.g. 1 million FLOATs, you can just slurp an 8MB chunk off disk instead of looking in 1 million different places. Postgres doesn't have a lot of functions for working on that kind of data though, so I wrote an extension to provide a few at [0]. I will probably add some more functions soon around masking out values so you can work "R/Pandas-style" on these arrays.

My approach breaks down if you are heavily updating your arrays though. On the mailing list Tom Lane suggested I store that kind of data outside the database.[1] But of course all software development is an attempt to have your cake and eat it too, so I wrote an extension that is a bit like cstore_fdw, but maybe more extreme: you can store/load/append a float array as an ordinary file.[2] It has similar operational disadvantages to cstore_fdw (no replication, no pg_dump, etc.) but performs faster. On the other hand it is definitely not battle-tested and still likely to change a little. (I need to improve the crash story and may add compression.) If you decide to use it, let me know!

Oh another limitation of keeping everything in an array (whether with floatfile or not) is that Postgres has a 1 GB limit on most RAM allocations. For example you could try running array_agg on a giant table and you might get a message about `Exception invalid memory alloc request size 1073741824`. I don't think cstore_fdw has that problem.

[0] https://github.com/pjungwir/aggs_for_arrays/

[1] http://www.postgresql-archive.org/Performance-appending-to-a...

[2] https://github.com/pjungwir/floatfile/

replication support is recently added to cstore_fdw. You can replicate a cstore_fdw table to another server using streaming replication.

This actually depends on your use case and type of your queries. There is a blog post (2014) about a benchmark: https://www.citusdata.com/blog/2014/06/14/columnar-store-ben...

I also believe that the main use case is for the compression of data more than the performance benefits.

If you have a common filter, it is a good idea to sort your data on that field and load into cstore_fdw. This will help to use skip indexes. https://github.com/citusdata/cstore_fdw#using-skip-indexes

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact