
When to Avoid JSONB in a PostgreSQL Schema - matm
http://blog.heapanalytics.com/?p=1115
======
drob
Author here. Curious what experiences y'all have had with JSONB.

We're in the process of switching to a more balanced schema (mentioned in this
post) and the results have been pretty good so far.

Another win has been that the better stats make it possible to reliably get
bitmap joins from the planner. Our configuration uses ~12 RAIDed ebs drives,
so the i/o concurrency is really high and prefetching for a bitmap scan works
particularly well.

~~~
mcdee
I'm using JSONB and the downsides on performance are not noticeable for most
cases. For those where there are real problems then crafting a custom index
usually fixes the issue.

Using ->> (or ->) in a WHERE statement is generally a bad idea, and certainly
a terrible idea without an explicit index. Use @> instead.

~~~
malisper
Using @> instead of ->> only causes the selectivity estimate of the predicate
to be a different hard coded estimate. It doesn't fix the underlying problem
of Postgres not keeping statistics on JSONB.

~~~
mcdee
True it doesn't solve the problem of not having statistics on the values, but
it does bring the query response time down to the same order of magnitude as
the non-JSON table.

~~~
malisper
> but it does bring the query response time down to the same order of
> magnitude as the non-JSON table.

In the specific example given it might, but you will still wind up with a
handful of queries that are planned wrong and are orders of magnitude slower.

~~~
mcdee
There are two separate issues. The lack of statistics is one thing, but the
use of ->> instead of @> is another. Look at
[https://explain.depesz.com/s/zJiT](https://explain.depesz.com/s/zJiT) Vs
[https://explain.depesz.com/s/ihwk](https://explain.depesz.com/s/ihwk) for the
difference.

~~~
malisper
Your queries are executing different plans. The first one is executing a
nested loop join which filters out 1,246,035,384 intermediate rows. The second
one is executing a index join which doesn't filter out any intermediate rows
at all. This seems like it was caused either by the scientist_labs_pkey index
not being there in the first trial or just random luck due to a difference in
statistics.

------
tgarma1234
The fact that you can pass attributes for a record into the JSONB field
without defining the table structure in advance is really the decisive feature
because then you never have to bother with changing your data model. For
example, if you have contacts streaming into your table from Android devices
you don't need to say whether or not there should be a column for "work email"
and "home email2" etc etc... you just send everything into a column with
key/value pairs and you can put whatever key/value pairs you want in that
column. And then query over the keys without inserting a gazillion nulls into
your database for rows that don't have a value for a particular key. You can
also do updates on the json column AND you get all of the benefits of
relational databases with other columns in the same table. I can't really see
how anyone would not love this data type now that I have been exposed to it in
production.

~~~
pjlegato
This is an attractive trap. This mode of data modelling is actually quite
terrible in terms of maintainability. It is precisely the problem with NoSQL
database models.

It doesn't mean there is no data model (schema), and it doesn't mean that the
data model is flexible. It actually means you have a succession of distinct
and undocumented schemas, which are updated on a haphazard, ad hoc basis, with
no documentation or thought given to this event.

Every version of every app and every support program ever written then has to
know about each and every historic variant of the data model that ever
existed. This is a maintenance nightmare when your app is more than a few
iterations old, and when you have several decoupled support systems trying to
use the database.

With an overt schema, you are required to at least think about what you're
doing and to do it in a centralized fashion, rather than slip changes in
haphazardly in any app that ever touches the database, and you're required to
ensure that the data already in the database actually conforms to the new
schema. You won't have one app that puts the work email in "work-email" with a
dash, and another that tries to use "work_email" with an underscore, for
example.

~~~
azinman2
You think database schemes are somehow immune to evolve haphazardly over time?

I've seen otherwise...

~~~
nbraud
The point is not that the DB schema will evolve sanely, nothing can guarantee
that (nor with JSON).

The point is that the schema is explicit: worst case, I can go look it up, and
I /know/ that the data conforms to it.

------
leothekim
"For datasets with many optional values, it is often impractical or impossible
to include each one as a table column."

Honest question - what settings would many optional values be impractical or
impossible? Is it purely space/performance constraints? If so, it doesn't
sound like JSONB gives you wins in either of those cases.

~~~
chc
Let me give an example: We process some large forms (say 100+ fields). We care
a lot about maybe 15 fields off these forms, but we use the others sometimes.
These forms change every so often, with some fields disappearing and new
fields joining in. The fields we care a lot about don't change, but the ones
we care a little about do. A few possibilities here:

1\. You could have a table for every variant of the form

2\. You could have a table with every field that's ever been used as a
nullable field and add new columns every time a new version of the form
appears

3\. You could have a table with columns for the important fields and a JSONB
column for the variable data

Of the three options, 3 seems the most elegant to me. The other two are
basically allowing the ugliness of the source data to spew complexity all over
your database.

~~~
cstavish
I use JSONB columns for similar use-cases, but to play devil's advocate, you
can accomplish that a fourth way, which is almost certainly better than 1 or
2. A table for fields, one row per field. A table for forms, which has a many-
to-many relation to fields. Entries in a link table compose a form of
arbitrary fields. Answers can be stored in a separate responses table, indexed
by form_id and column_id. I don't know enough about database implementation to
speculate on how that would perform at scale, but conceptually that's how I
think of the problem.

~~~
malisper
This is roughly EAV. EAV is where you have a with a schema:

    
    
        entity_id | attribute_id | value
    

EAV is typically considered to be an anti-pattern for several reasons: it
becomes very expensive to rematerialize a entity with all of its
attributes/values, it becomes difficult to have any kind of constraints on the
values (this is also a problem with jsonb), and it's hard for the database to
maintain good statistics on the values of different attributes as most
databases don't keep multi-column statistics. Don't worry, I've had similar
ideas before.

~~~
strommen
EAV is an anti-pattern?

If you don't have JSONB, EAV is the only remotely-reasonable way to implement
user-defined fields (e.g. product-specific attributes in e-commerce).

~~~
malisper
If you want to allow user-defined fields in a relational database, your
realistic are either EAV or stuff json into a text column. EAV, if done
extremely carefully, can be a good solution, but 99% of the time, it's going
to be a huge pain.

~~~
approachingtraj
> stuff json into a text column

You did see the article was about JSONB, which is significantly more
sophisticated than "json in a text column", yes?

~~~
malisper
Yes, the person I was replying to was asking about what to do if you don't
have jsonb.

------
simiano
Keep in mind that Heap collects a huge amount of data and has a huge dataset.
I like this[1] talk, it gives you an overview of their architecture. Great
post though, thank you for sharing.

[1]
[https://www.youtube.com/watch?v=NVl9_6J1G60](https://www.youtube.com/watch?v=NVl9_6J1G60)

------
IsmaOlvey
One thing I would add to the list: Don't use it for data that you need to
change. There are (at least at present) no built-in functions to modify values
within JSON(B) objects, which makes it very tedious modify data once it has
been stored.

It is much better for data that is stored once and then queried.

------
vog
_> It has no way of knowing, for example, that record ->> 'value_2' = 0 will
be true about 50% of the time_

Can't this be solved by introducing an expression index[1] for "record ->>
'value_2'"?

This would add a specialized index that will be used of all queries that have
a filter like "WHERE record ->> 'value_2' = 0 AND ...".

[1] [https://www.postgresql.org/docs/current/static/indexes-
expre...](https://www.postgresql.org/docs/current/static/indexes-
expressional.html)

~~~
drob
The expression index will make it fast to retrieve the rows for which that
predicate is true, but it won't help the planner know that this will be the
case for 50% of rows, so I don't think it will change the join that the
planner selects (which is the problem here).

In fact, this might make the query slower. If postgres thinks it is selecting
a very small number of rows, it will prefer an index scan of some kind, but a
full table scan will be faster if it's retrieving 1/8th of the table (at
least, for small rows like these). So, you might get a slower row retrieval
and the same explosively slow join.

~~~
parenthephobia
Postgres stores statistics for expression indexes, so it can know that the
predicate is true for half the rows.

In the worked example, adding expression indices for the integer values of
value_1, value_2, and value_3 makes the JSONB solution only marginally less
efficient than the full-column solution. On my computer, ~300ms instead of
~200ms.

(This is Postgres 9.5)

~~~
malisper
A while ago I came across this thread[0] in which Tom Lane brings up the fact
that statistics are kept on functional indexes. I can't remember why, but for
some reason I couldn't get the planner to do what I specifically wanted. It
may have been a weird detail about composite types.

Separately, the big downside I see with this approach is that it requires
indexes on every field you would ever query by. If we were to create
expression indexes on each field in the jsonb blob, that would effectively
double the amount of space being used as well as dramatically increase the
write cost.

[0] [https://www.postgresql.org/message-
id/6668.1351105908%40sss....](https://www.postgresql.org/message-
id/6668.1351105908%40sss.pgh.pa.us)

------
warmwaffles
Gee go figure that using native columns is faster and better to query over.

------
buremba
In fact, JSONB is probably not a good idea when it comes to analytics.

The storage is almost x2, accessing attributes are expensive than tabular data
even though JSONB is indexed, the data can be dirty (the client can send extra
attributes or invalid values for existing attributes, since JSONB doesn't have
any schema, Postgresql doesn't validate and cast the values) and as the author
mentioned, Postgresql statistics and indexes don't play nicely with JSONB.

------
sgt
Most of our tables have a uuid, a jsonb entity along with relationships stored
as uuid columns with a _fk suffix. We then index the FK's.

------
silverlight
Does adding a Gin index to the JSONB help this?

~~~
malisper
A Gin index only helps with querying the data from the table. It won't help
with making the proper join choice or with getting a bitmap scan between
multiple indexes. Additionally, you are unable to query numeric values by an
inequality with a Gin index.

~~~
throwa
Does using a UNION OR UNION ALL query instead of a join query reduce the
implication of jsonb column not having statistics, especially since the
postgresql query planner might not use the nested loop join.

------
manigandham
Why dont document databases automatically save common keys in some kind of
lookup table?

Seems like a basic feature to improve space savings and processing speed.

~~~
hendzen
It's much simpler to just use an off-the-shelf compression (zlib, lz4, etc) on
the database pages. This basically has the same effect, but also compresses
common values.

~~~
manigandham
In that case, why isnt that normal and key name size a non-issue with document
stores and JSON columns?

Am I missing something on why this isnt done already and automatically?

~~~
hendzen
If you use MongoDB with the WiredTiger engine, the database pages are
compressed with Snappy by default. On other systems you have the option of
running the database on a filesystem that provides compression.

------
palmdeezy
Hey SQL newbie question: why use JSONB when you could split out tables into
`user` and `user_meta`? isn't that how Wordpress works?

~~~
bruce_one
It all depends on what you're doing...

But, we had a situation where we had a "user_meta" equivalent, but wanted to
support different data types (and even possibly nested data) using JSONB
allowed for simple modelling of something like `{ "age": 1, "school": "blah",
"something": { "in": "depth" } }` which isn't as simple using an extra "meta"
table.

(Not to say it's the best thing to do (depending on the situation it might be
better to have explicit columns for those fields) but it's an example of how
it can be more powerful than just having an extra "meta" table.)

~~~
palmdeezy
Ahhhh so this will be super useful if I want to keep track of transaction
events from a third party (like Shopify or Stripe). I could just keep table
with `ID` `user_id` `time` `blob`.

------
rhinoceraptor
Using jsonb also brings the headache of having to worry about the version of
Postgres you're using. Simple functionality like updating an object property
in place might be missing in your version. And the documentation and stack-
overflow-ability of json/jsonb is not very good yet.

But as an alternative to things like serialized objects, I think it's
definitely a huge win. You can do things like join a jsonb object property to
its parent table, which wouldn't be possible with serialized objects.

~~~
pgaddict
And why is this a headache? Every time a new feature is introduced, you have
to worry about the PostgreSQL version.

If you're developing an application in-house, this is not a big deal - you can
make sure you have the right PostgreSQL version. If you're hosting the
application on a shared database server, well, you're exactly in the same
situation as with other software products.

------
edoceo
I used JSONB to keep the Tags on an object. Used to be EAV. Works way better
here

------
crorella
I usually use JSONB but keep a non jsonb column as the PK.

------
knucklesandwich
Definitely have been bitten with the query statistics issue before. I worked
with a colleague once who was adamant that we build our backend on MongoDB,
but I was able to convince him to build on Postgres because of it's JSONB
support. I don't get why, since schema updates are generally very cheap with
databases like Postgres (adding columns without a default or deleting columns
is basically just a metadata change), but some developers believe its worth
the headache of going schema-less to avoid migrations. In a sense, that
suggestion kind of bit me in the ass when we started having some painfully
slow report generation queries that should have been using indexes, but were
doing table scans because of the lack of table statistics. In a much larger
sense, I'm still thankful we never used MongoDB.

Protip: Use the planner config settings[1] (one of which is mentioned in this
article) with SET LOCAL in a transaction if you're really sure the query
planner is giving you guff. On more structured data that Postgres can
calculate statistics on, let it do its magic.

[1]: [https://www.postgresql.org/docs/current/static/runtime-
confi...](https://www.postgresql.org/docs/current/static/runtime-config-
query.html)

~~~
pmontra
You need some migrations anyway or you'll get cruft in the db or worse. Think
of old documents with extra fields your sw don't use anymore or without fields
that are needed. Multiply by embedded documents and you get ton of problems
you can solve only by taking care of data.

This happens even in development before going live for the first time, and way
more often as you keep changing sw. Even if you throw away the data every time
you still have to update the seeding scripts (in a relational db you have
seeding + schema changes).

Anyway, what did you do? Did you keep using JSONB with that planner config
setting or did you extract some data to ordinary columns?

~~~
knucklesandwich
Totally agree. At a previous job, some of the senior engineers decided to use
MongoDB as the main data store, and doing migrations was among the worst
things about it. I think some engineers envision that they'll just be able to
do read repair and things will magically work. In practice, you can only
really do read repair when you have a workload oriented to reading single
records at a time and you have strict controls on concurrent access to prevent
weird A-B-A errors with read repair. Complex aggregate queries are almost
always impossible with read repair. Even with single records, read repair is
still a pain in the ass. You often have to maintain unmarshalling code for
several versions of a record formats. In the end, one of the engineers ended
up having to write some internal migration tool (which was of course strictly
worse than migrations via Postgres, because schema changes _did_ require
rewriting a table with update queries, so we ended up needing a bit of
downtime). Even with the migration tool, shipping always required a lot of
people on call, since migrations would inevitably break during the release
process due to frequently brittle migration code.

As for the above story, that engineer was sort of on his way out at the time,
so I used the above method to provide query hints as a short term fix. After
he left, I was able to restructure the event data schema to make more use of
columns. Some of the ancillary attributes that weren't used for row selection
stayed as jsonb, but things like timestamp, event name, user id, etc. were
moved to columns.

