

Compressing PostgreSQL JSONB data 12x using cstore_fdw - mslot
https://www.citusdata.com/blog/14-marco/156-compressing-jsonb-using-cstore-fdw

======
etrain
The chosen benchmark (a customer reviews) table is likely something that
benefits tremendously from compressed columnar storage: 1) It has a small
number of attributes which are almost always present in the records - that is,
a fixed schema. 2) The most of the fields are numeric/date, or text with low
cardinality (product_category, etc.) These things respond well to huffman
codes and run-length encoding.

This is a use case where JSON shouldn't really ever be used, because the
schema is pretty much fixed and highly regular. JSONB records essentially
carry the schema definition with them _per-record_ and in this case most of
that information is duplicated - hence the blowup in its representation on
disk.

While column stores are great for answering analytical queries that require
scans over the whole table (like the single query example they show), they
aren't as good at transactional queries (like serving webpages).

If I were citus, I'd have written the blog post using a dataset of highly
irregular JSON blobs - e.g. log messages from lots of different systems or a
big collection of web pages (serialized as json representations of the DOM).
Maybe we'll see these "in the coming weeks."

~~~
ozgune
(Ozgun from Citus Data)

Sure, we'd be interested in running more numbers. If you have example data
sets in mind, could you share them with us?

For clarification, we picked this data set for several reasons. The data set
was real, sizeable, publicly available, and it became highly referenced in
PostgreSQL's JSON/JSONB development:

[http://www.pgcon.org/2014/schedule/attachments/328_9.4json.p...](http://www.pgcon.org/2014/schedule/attachments/328_9.4json.pdf)

[http://www.pgcon.org/2014/schedule/attachments/313_xml-
hstor...](http://www.pgcon.org/2014/schedule/attachments/313_xml-hstore-
json.pdf)

[http://blog.2ndquadrant.com/jsonb-type-performance-
postgresq...](http://blog.2ndquadrant.com/jsonb-type-performance-
postgresql-9-4/)

[http://www.pgcon.org/2014/schedule/attachments/318_pgcon-201...](http://www.pgcon.org/2014/schedule/attachments/318_pgcon-2014-vodka.pdf)

~~~
ddorian43
Can you explain HOW are you compressing the json data? Ex, is it just block-
pgzip-compress? Or are you exploding each jsonb-field as a separate file like
with normal columns ?

------
mirekrusin
"Note. We currently don't support updating table using DELETE, and UPDATE
commands. We also don't support single row inserts.".

Just "COPY TO ..." or "INSERT INTO ... SELECT ..." is available it seems.

(source
[https://github.com/citusdata/cstore_fdw](https://github.com/citusdata/cstore_fdw))

~~~
azinman2
So that's a deal killer. Damn.

~~~
greggyb
The biggest bang for your buck with columnstore indexes is for analytical
loads. In a data warehousing situation, these are negligible restrictions.

Your use-case is not representative.

~~~
testrun
So how do you know what his/her use case is?

~~~
greggyb
I don't know what it is, I only know that this comment prompted the deal
killer comment:

>>"Note. We currently don't support updating table using DELETE, and UPDATE
commands. We also don't support single row inserts.". Just "COPY TO ..." or
"INSERT INTO ... SELECT ..." is available it seems.

>So that's a deal killer. Damn.

This limitation is not a dealbreaker at all in a data warehousing environment,
as I explained. Thus, I can assume that their use case is not data
warehousing.

~~~
azinman2
Correct. I'm not data warehousing. I'm running a parse-like backend out of
Postgres, where I have a lot of duplicated json keys. It would be great to
have compression on the json to include this, while maintaining the other
advantages of the json column store. However as they're hierarchical and there
are many of them columns aren't appropriate (as far as I know but I haven't
benchmarked some crazy dynamic solution... json certainly is easier to work
with).

~~~
greggyb
I would keep an eye on the columnstore space. I know that Microsoft introduced
read-only columnstore indices before introducing updateable ones, and their
turnaround on it was 2 years - SQL Server 2012 to 2014.

I'd use that as the outside end of an estimate for when the Postgres team will
get updateable columnstore working.

~~~
ddorian43
Note, updatetable columnstore sucks everywhere, so I don't think so. It's only
for OLAP, not OLTP (parse is oltp).

What may help @azinman is group-compression (ex: bigger pages of rows and
compress the whole page, where each json-key can be repeated multiple times in
a page since each page may have multiple rows). This is what happens on
tokudb,hbse,hypertable,cassandra.

------
coryfoo
Ostensibly this article is about the wins you can get as far as disk size is
concerned by using cstore_fdw, but I can't help but notice they weren't
creating any indices on this data.

By creating a simple GIN index on the regular jsonb table, you could
dramatically improve those query times. So if your workload doesn't require
that you do a table scan for all records then the tradeoff in terms of disk
space would more than make up for the perf wins provided by the index.

Does anyone know if cstore supports creating indices of any kind, let alone
GIN?

~~~
ddorian43
columnstore is supposed to not work with indices (although they have different
types like zone-maps)

each column is stored separately on disk, so only the requested column-values
are read from disk

This makes slow to select/update/delete a single row(oltp) since it needs to
fetch multiple pages (for each column). And makes it fast to do queries
regarding data in big size (olap) by doing vectorized query execution and
sequential-reads on disk

------
gtrubetskoy
Do you have any resources describing and explaining the internals of
cstore_fdw? Talks, documents, etc - something other than having to read the
code (though I'm fine with that as well). I would very much like to understand
the underlying mechanism of cstore_fdw.

~~~
mslot
The best resource is the github page:
[https://github.com/citusdata/cstore_fdw](https://github.com/citusdata/cstore_fdw)

There's also a video from SFPUG last year, though it's not great quality:
[https://www.youtube.com/watch?v=yT34i82o99w](https://www.youtube.com/watch?v=yT34i82o99w)

