
Run SQL on JSON files without any data loads - mihar
http://citusdata.com/blog/65-run-sql-on-json-files-without-any-data-loads
======
ot
A few years ago I was toying with JSON databases without loading, and came up
with an idea to _cache_ the parse tree of the documents after the first query,
so that subsequent queries would run much faster. I called the technique
_semi-indexing_ and wrote a paper [1] on that; on my synthetic tests the
speedups were significant (even 10x), but I never got the chance to test it on
real workloads.

I wonder if would be useful to integrate the semi-index code [2] in this
json_fdw; thanks for sharing the code, I'll try to see how feasible this is.

[1] <http://www.di.unipi.it/~ottavian/files/semi_index_cikm.pdf>

[2] <https://github.com/ot/semi_index>

~~~
mmastrac
This paper is fascinating. I have an extremely fast JSON parser for Java that
I've been working on, on and off [1], and I'd love to optionally augment it
with semi-indexes for cases where end-users are using a subset of a file.

[1] <https://github.com/mmastrac/nanojson>

~~~
ot
Very interesting! Implementing a semi-index is very easy if you have
implementations of Elias-Fano and balanced parentheses data structures.

For Java I recommend Sux4j [1] that has a very good implementation of Elias-
Fano, but I think that balanced parentheses are very primitive. I was told by
the author that a better data structure, based on Range-Min-Max trees, should
be added soon.

[1] <http://sux.dsi.unimi.it/>

------
mortov
I remember when it caught my eye as Postgres95 and I've always preferred it,
even in the face of the huge surge for MySQL which left PostgreSQL looking
like an also-ran for quite while.

The reality is it is one of the most amazing open source projects I know -
enough to rival Linux itself for me.

These little snippets of PostgreSQL features and facilities which keep being
posted give just a tiny hint of the truly enterprise grade features and
facilities it has - I just wish I still had as much time to keep on top of it
as I once did but I still keep it as my go-to database for project
development. I've even dropped it in to replace full blown Oracle systems for
serious cost savings.

The developers should be receiving far more kudos than they do.

------
d4nt
I don't make a habit of spamming discussion threads but... my product
QueryTree (<http://querytreeapp.com>) will load JSON files even if they have
lists at the root, you can use the Append tool to bring multiple files
together, and you can then use Filter, Sort, Join and Group without ever
having to define a schema.

Admittedly, it's aimed at novice users so if you're comfortable with SQL it
may feel limiting.

~~~
MartinCron
It's not spamming if it's relevant.

------
knowtheory
I am mystified as to when this tool would be useful (can anybody else think of
a practical use case?).

You can run SQL queries on individual JSON files... but you have to have
PostgreSQL installed already.

And you can't run it against multiple JSON files, only one, with a single JSON
object at the root.

~~~
ozgune
(Ozgun at Citus Data)

You're right that you need to have Postgres installed. For running SQL over
multiple JSON files, we wanted to keep the blog post short and noted several
different ways to go over multiple files in our GitHub Readme.

1\. You can create a partitioned PostgreSQL table, and declare one child table
per JSON file. You can also declare constraints on the child table to filter
out irrelevant files to the query.

2\. You can create a distributed foreign table on CitusDB, even if you'd like
to run on one node. In this case, we'll collect statistics automatically for
you, and filter out irrelevant files.

3\. If neither of these fit, you could change the source code to scan over one
directory instead of a file. We didn't go down this path to be compatible with
file_fdw's semantics.

~~~
neilk
Okay, so if I understand correctly, the benefit here is that I can run some
kinds of SQL queries on a large collection of heterogenous JSON documents. Is
that right? You're still speaking Postgres-internals-geek, not I-want-to-get-
something-done geek.

That is:

\- you can take heterogenous JSON documents, and then make them appear to the
Postgres world as separate tables. (Which they have to be, because they're not
uniform)

\- you can then take some kinds of SQL queries and distribute them among these
tables, as if they were one big SQL table. If some pseudo-tables have missing
or extra columns, no problem.

~~~
ozgune
Thanks for the comment. Yes, that's exactly the benefit.

You could in fact have heterogeneous JSON documents within one file, and
json_fdw could still handle them. If a field you declared in the table schema
doesn't appear in a JSON document, json_fdw would just consider that field to
be Null.

The issue is, json_fdw is built such that one table can only associated with
one JSON file, and users typically have many JSON files lying around. For
example, you'd have one JSON file for each hour's worth of website error logs.

If that's the case, you could create a distributed table. That way, you could
run the query one distributed table and have the query run on all your JSON
files instead of just one.

That said, it's probably best to associate one type of JSON logs (say your
website error logs) with one distributed table. That way, your queries go over
your website error logs or your mobile application logs, but not both at the
same time.

------
ajtaylor
With the upcoming release of Pg 9.3, I've been reading a lot more posts and
articles about the FDW (foreign data wrapper) capabilities. I had not
considered all the things FDW can do, but I'm getting more and more excited
the further I get into it.

For instance, at $work we have a bunch of tables which are archives of raw
click data. They are taking up quite a bit of space on disk, which we'd like
to reclaim. Keeping the raw data around is good, since I want to create a
summary table and you can't recreate raw data from summaries. The idea was to
export the table data as CSV, and then when I had the time to for messing
around with summaries I'd reload the data. With FDW I can leave the data as
CSV and read it just like it was in a proper DB table. Win!

Or even better, again at $work we use Mysql. But I hate it because SQL strict
mode isn't enabled and mysql does really stupid things with my data in
"normal" mode. I can't safely turn on strict at this point because I don't
have the time for testing everything. I also really like Pg and would love to
switch. But again, I don't have the time to do it. What I've been thinking
about is using the FDW capabilities (in particular the upcoming writable FDW
in 9.3) as a way to safely and slowly migrate DB platforms. It's only an idea
in my head right now, but it's an intriguing one.

------
languagehacker
I kind of think the more common use case is accessing JSON data _without_
defining a schema. In fact I wouldn't mind loading the data so long as it
doesn't persist for too long. If only there was a schemaless database that
allowed you to set a time to live on the JSON objects you load into it... that
would be humongous.

------
tantalor
How is this "without any data loads"? You have to map each file to a database
table and then execute SQL against that database. Isn't that exactly what
"loading into a database" means?

~~~
femami
Loading into database usually means transforming the data such that the data
is stored as database's internal format. For example, using "LOAD DATA ..." in
MySQL or "COPY ... FROM ..." in PostgreSQL.

Here, you aren't storing the data in the database. You don't have to take any
additional action to sync the file and db when you add rows to the file, so
it's not loaded into the database.

~~~
cwsteinbach
Another benefit of this approach is that it allows you to save storage space
by eliminating the need to create a copy of the JSON data in the DB's own
internal format.

------
caycep
actually, I might use this thread to ask a noob question - I've been trying to
fiddle around with some data that is relatively schemaless (i.e. I have a list
of medical research subjects, and each of them have varying numbers of prior
medical conditions, medication allergies, and list of current and past
medicines).

Naturally, I thought JSON was better than relational DB's for this, but I
wasn't sure how best to store this data. The relation data that I do have is
stashed in .csv files that I'm transitioning over to sqlite, so I was
wondering if there was a sqlite-esque DB for JSON - I was assuming something
like Couch or Redis. But am I missing something, and that JSON itself IS the
database?

~~~
femami
I think if your data is append-only, this should work good enough. You
implement your data appending logic, and you use json_fdw for querying your
data. But this requires starting PostgreSQL service, I'm not sure if this is
acceptable in your application or not.

------
fmariluis
You can also use Ojota (a flat file database with ORM -
<https://bitbucket.org/msa_team/ojota/overview>) for something like this.

------
spullara
You can use YQL to do this — even join with other data sources.

<http://developer.yahoo.com/yql/console/>

------
rjurney
Apache Pig and Hive are other systems that can do this, minus the one file
limitation.

~~~
femami
json_fdw to Citus DB or any other PostgreSQL based distributed database, is
like json serde to Hive.

So, I think you should probably compare the json_fdw to json serde, and Citus
DB to Hive.

------
kashnikov
I'd rather see the code they used to crawl Amazon reviews...

~~~
femami
Is this useful? [http://www.esuli.it/software/amazon-reviews-downloader-
and-p...](http://www.esuli.it/software/amazon-reviews-downloader-and-parser/)

------
binarysolo
Commenting to save thread -- looks great!

------
dksidana
Interesting

