
JSON with Sqlite - ausjke
https://www.sqlite.org/json1.html
======
fforflo
I've been using the JSON1 extension for some time now (in production projects)
and it's truly remarkable. I usually just dump the JSON-response data from an
API to a "raw_data" table (typically one "updated_at" column and a second
"json_data" one).

At that point you can somehow normalize your schema, but only if you really
have to! That is because you can get away with a NoSQL-like denormalized
schema performance wise, by carefully defining index on expressions. You can
somehow normalize it with views (SQLite doesn't support materialized views).

And of course it's almost always faster to query data where it exists (via
SQL) instead of fetching it from disk and querying it say Python. You pay too
much IO cost. (Yes, my dear aspring data scientist, do not load everything in
a huge DataFrame, go learn yourself some SQL :-) )

The json dump is not stored in binary, but in text format, but honestly I
haven't seen this to be a problem, plus you can easily run queries at the CLI
and pipe the output to jq, sed etc.

If your application is data warehouse-like and read-heavy (for example an
internal reporting dashboard) I can't see any reason why you should pay the
cost of setting up a Postgres or MongoDB instance (although I do love both.)

It is true that SQLite does not support concurrent-writes, but (and that's a
big BUT) if you carefully open connections only when you need them and use
prepared statements, I can't see how you could run into problems with modern
SSD hardware (unless you're Google-scale of coure).

~~~
stuxnet79
> (Yes, my dear aspring data scientist, do not load everything in a huge
> DataFrame, go learn yourself some SQL :-) )

This hits a little bit too close to home for me. I am quite proficient at
writing performant SQL queries and recently started using Pandas. I find the
data frame abstraction better for certain data manipulation tasks. Assuming
there is enough RAM available is it still better to offload everything to the
database engine?

~~~
fforflo
As usual: it depends... If you're doing prototyping and are working in a
Jupyter notebook, sure, go ahead and work on the Pandas-level. Once however
you're done with prototyping and have settled to a "final_df" (I bet you have
something like that in your last notebook cells), maybe you should think
transforming some of the "columns" to sql queries (which are VCS-able,
sometimes are faster, and most importantly other people can use them too. And
instead of 10 people loading 10 different DFs, you can have 10 people querying
the same table/view.

------
rgovostes
I recently wrote an SQLite extension that lets me query stored Protobuf
messages. I was inspired by the JSON extension. Adding virtual tables to
SQLite with an extension is tricky but kind of magical when you finally have
the full query language to play with.

[https://github.com/rgov/sqlite_protobuf](https://github.com/rgov/sqlite_protobuf)

~~~
tylerl
Do you have performance numbers, especially in comparison to json1? Wondering
how much benefit you get from a binary encoding scheme.

~~~
rgovostes
I don’t, sorry. I did not build this with performance in mind, but for
querying packet captures of a protocol built on Protobuf.

Any query is going to require deserializing every row in the table (JSON and
Protobuf alike) which is basically a non-starter for any project with
performance requirements. And I believe that Protobuf messages are decoded as
a whole, rather than just the desired field, which is going to be slower.

~~~
scottlamb
Interesting library, thanks!

> Any query is going to require deserializing every row in the table (JSON and
> Protobuf alike) which is basically a non-starter for any project with
> performance requirements.

Apparently you can create indexes on the json functions, so I assume on your
proto ones too.

> I believe that Protobuf messages are decoded as a whole, rather than just
> the desired field, which is going to be slower.

With the official library, yes, but it's an implementation/API choice. The
wire format [1] can be skimmed fairly efficiently. Most significantly,
message/byte/string fields all are written as tag number, length, data. So if
there are submessages (entire trees) you don't care about, you can just skip
over them. I've seen custom proto decoding things do this.

I think the most efficient approach would be to parse the path once per SQL
query into a tag number-based path. And then have the per-row function just
follow those with custom decoding logic. Unfortunately from my quick skim, it
looks like SQLite's extension API doesn't really support this. You'd want it
to build some context object for a given path/proto, then call extract with it
a bunch of times, then tear it down. Still, I suppose you could do a LRU cache
of these or something.

btw, I see your code is compiling a regex [edit: originally wrote proto by
mistake] in the per-row path. [2] I haven't profiled, but I'd bet that's
slowing you down a fair bit. You could just make it a 'static const
std::regex* kPathElementRegexp = new std::regex("...")' to avoid this. (static
initialization is thread-safe in C++. The heap allocation is because it's good
practice to ensure non-POD globals are never destructed. Alternatively,
there's absl::NoDestructor for this.)

[1] [https://developers.google.com/protocol-
buffers/docs/encoding](https://developers.google.com/protocol-
buffers/docs/encoding)

[2]
[https://github.com/rgov/sqlite_protobuf/blob/0a148ac6a5a2c02...](https://github.com/rgov/sqlite_protobuf/blob/0a148ac6a5a2c02ec4f9a80cad5f4ee51175ff5a/src/protobuf_extract.cpp#L113)

~~~
rgovostes
You mean it's recompiling the regex on every pass? Ah, yeah, I hadn't thought
about that. Thanks for the tip.

------
punnerud
To check if you already have it installed run "pragma compile_options;" and
look for ENABLE_JSON1. On my Mac I had it installed both on sqlite3 in
terminal/bash and Python :-)

    
    
      import sqlite3
      con = sqlite3.connect(':memory:')
      con.enable_load_extension(True)
      a = con.execute("pragma compile_options;")
      for i in a: print(i); #check for ENABLE_JSON1
    

The newest releases of sqlite3 have it included. If not you can build it this
way: [https://burrows.svbtle.com/build-sqlite-json1-extension-
as-s...](https://burrows.svbtle.com/build-sqlite-json1-extension-as-shared-
library-on-os-x) Note that you will only get this error on load if you have
it: 'Error: error during initialization:'

------
toomim
What are some useful use-cases for this?

I can already use JSON in Sqlite by doing parsing in the client outside of the
Sqlite API. Any examples where I'd want to use this instead? I'm guessing for
where clauses in queries, perhaps?

Can I create an index on a field within a JSON tuple?

~~~
kokokokoko
What I have found with any relational database that has this kind of JSON
storage option(MySQL, PostgresSQL) is that you can take some kind of data that
arrives via 3rd party as JSON and just dump it straight into the db as JSON as
opposed to having to create a schema for it.

It's nice to have the option for the data to still be queryable without having
to make it a first class schema with all schema setup involved.

It isn't for 1st class data that is used frequently. But when you might only
need the data on occassion its nice to have productivity-wise.

~~~
remus
One pattern that I've used when parsing large json responses is to have a
table schema like

    
    
        frequently|used|data|in|these|cols|full_json_response
    

So you pull out the fields you need frequently but have the full json so you
can dig in to details the response on an adhoc basis. Not very efficient as
you can end up storing a big blob of json, but you're not throwing away any
data when you parse the response and your schema stays very manageable. Also
gives you the flexibility to add new fields to the schema by pulling the data
out of the json.

~~~
oefrha
I use this pattern a lot, too. Just make sure not to `SELECT *`, which could
be much slower and waste a lot of RAM in some cases (ORMs using e.g. active
record pattern are often susceptible to this when used naively).

------
willvarfar
From a glance, this looks like its compatible with MySQL's JSON functions.

It isn't compatible, for example, with MariaDB's dynamic columns.

I'm not very familiar with the PostgreSQL json functionality, but I think
that's subtly different again.

When I use local dbs for testing my json code I've used derby and defined the
json_extract() functions etc myself just to test. With Sqlite having
compatible functions, people wanting to test mysql stuff locally will be able
to just point their code at an Sqlite DB instead. Great!

The bit that seems to be missing is the shorthand for selecting column values;
in MySQL, instead of doing SELECT JSON_EXTRACT(col, "$.this.is.ugly[12]"), ...
you can just do SELECT col->"$.this.is.ugly[12]", ...

Now what I want to be able to write is SELECT col.this.is.nicer[12], ....

I think there is some 'standard' somewhere that MySQL - and now Sqlite - is
implementing? The functions and the 'path' syntax are standardized (although
with only MySQL and now Sqlite supporting them its not perhaps a big deal). I
just can't find any reference to that standard in the MySQL docs, nor this
Sqlite doc.

Personally, I dislike the path syntax though! Every time I see an sql snippet
with string paths full of dollar signs it offends my retinas.

~~~
tannhaeuser
According to [1], ISO/IEC 9075:2016 specifies JSON functionality in ISO SQL,
though I don't have access to it/didn't buy the standard text. I also don't
know whether Postgres etc. is designed to follow the standard, as Postgres'
implementation predates it, but would expect sqlite to.

[1]: [https://modern-sql.com/blog/2017-06/whats-new-in-
sql-2016](https://modern-sql.com/blog/2017-06/whats-new-in-sql-2016)

~~~
masklinn
Postgres 11 doesn't follow the standard at all (as per your link). A third
party implemented the standard [0] which was submitted on pghackers[1] very
late in the Postgres 11 cycle[2] (at 75k gzipped the patch is pretty
humongous). A quick googling doesn't show any information pertaining it being
integrated (in part or full) into pg12 so IDK.

edit: apparently
[https://commitfest.postgresql.org/17/1471/](https://commitfest.postgresql.org/17/1471/)
has been "Waiting on Author" and bumped from CF to CF since early 2018, and
[https://commitfest.postgresql.org/17/1472/](https://commitfest.postgresql.org/17/1472/)
and
[https://commitfest.postgresql.org/17/1473/](https://commitfest.postgresql.org/17/1473/)
pretty much the same with no "waiting on author" but I don't really know how
CF works and I see no comment or requests or reviews so…

[0]
[https://obartunov.livejournal.com/200076.html](https://obartunov.livejournal.com/200076.html)

[1] [https://www.postgresql.org/message-
id/CAF4Au4w2x-5LTnN_bxky-...](https://www.postgresql.org/message-
id/CAF4Au4w2x-5LTnN_bxky-mq4=WOqsGsxSpENCzHRAzSnEd8+WQ@mail.gmail.com)

[2] [https://www.postgresql.org/message-
id/00531c7e-f501-b852-9b6...](https://www.postgresql.org/message-
id/00531c7e-f501-b852-9b67-1d1278d035a0%40pgmasters.net)

~~~
anarazel
They're actively working on it. There's been a new revisions posted over the
last few days, responding to review from a week or two ago.

[https://www.postgresql.org/message-
id/a3be6a7a-77d3-0e88-4f9...](https://www.postgresql.org/message-
id/a3be6a7a-77d3-0e88-4f9d-4f725d11d7cd%40postgrespro.ru)
[https://www.postgresql.org/message-
id/c2f32c9f-9a69-202b-a8a...](https://www.postgresql.org/message-
id/c2f32c9f-9a69-202b-a8aa-d93c769a579e%40postgrespro.ru)

> I see no comment or requests or reviews so…

That happens on the mailing list...

------
staticelf
SQLite is really an awesome database, this seems great to store arbitrary json
data!

------
Sean1708
> Backwards compatibility constraints mean that SQLite is only able to store
> values that are NULL, integers, floating-point numbers, text, and BLOBs. It
> is not possible to add a sixth "JSON" type.

I'd be interested to know what these constraints are? Does SQLite guarantee
that files created with newer SQLite versions are still compatible with older
SQLite versions?

~~~
tarruda
From
[https://www.sqlite.org/onefile.html](https://www.sqlite.org/onefile.html) :

> The SQLite database file format is also stable. All releases of SQLite
> version 3 can read and write database files created by the very first SQLite
> 3 release (version 3.0.0) going back to 2004-06-18. This is "backwards
> compatibility". The developers promise to maintain backwards compatibility
> of the database file format for all future releases of SQLite 3. "Forwards
> compatibility" means that older releases of SQLite can also read and write
> databases created by newer releases. SQLite is usually, but not completely
> forwards compatible.

~~~
kierenj
And yet, "The json1 extension uses the sqlite3_value_subtype() and
sqlite3_result_subtype() interfaces that were introduced with SQLite version
3.9.0 (2015-10-14) The json1 extension will not work in earlier versions of
SQLite."

~~~
matharmin
This just means you can't use the json1 extension with an old version of the
library. But anything you write using it, will still be readable with an older
version.

Of course, if you do something like creating an index using a json1
expression, that will cause issues if you try to use the database with an
older version.

------
qwerty456127
This should be build in standard SQLite.

~~~
tracker1
SQLite is used in a _LOT_ of environments, including low-overhead embedded
systems. This means it's impractical to add the overhead for those systems. As
to desktop and mobile, I absolutely agree, it should be the default in a lot
of places.

------
eli
Is loading extensions at runtime new? Coulda sworn I used to have to recompile
for this

~~~
pstuart
It seems valuable enough to have in the standard build. That and the CSV
extension as well.

~~~
pjtr
Especially for things like Python's embedded SQLite. Packaging and loading an
extension module cross-platform with an otherwise pure Python script seems
difficult.

------
qwertox
The JSON1 extension can be used in Android with the sqlite-android library

[https://github.com/requery/sqlite-android](https://github.com/requery/sqlite-
android)

------
paulddraper
One of the nicest things you can do with this is get hierarchical results.

------
IloveHN84
Why not converting JSON from/to BLOB?

~~~
robocat
From the article:

The json1 extension does not (currently) support a binary encoding of JSON.
Experiments have been unable to find a binary encoding that is significantly
smaller or faster than a plain text encoding. (The present implementation
parses JSON text at over 300 MB/s.) All json1 functions currently throw an
error if any of their arguments are BLOBs because BLOBs are reserved for a
future enhancement in which BLOBs will store the binary encoding for JSON.

~~~
ricardobeat
Seems like it has been updated to say 1GB/s now.

