
Show HN: Querying JSON documents using SQL-like language in Scala - mmalek06
https://github.com/mmalek06/JsonSql
======
cube2222
As people always do in such threads, I’d too like to chime in with
alternatives I’ve been a co-author of, as this is an area I’m very interested
in.

The first one, which is definitely close in spirit to this project is
OctoSQL[1]. We too are trying to query all the things with SQL, including
JSON, though we extended the idea to supporting more data sources and file
formats, with the ability to mix and match in joins and subqueries. Curious
what direction this project will take though, as we’re now heading for
streaming sources (like Kafka) in pure SQL!

A totally different one is jql[2] where I’ve been exploring more of a lispy
continuation passing style based approach to a JSON query language, as I
didn’t really feel like the current ones are ergonomic.

If you like this, make sure to check them out too! (And all the others people
will be posting, as I think they all are fascinating! This is a big area with
lots of space for innovation left.)

[1]:
[https://github.com/cube2222/octosql](https://github.com/cube2222/octosql)

[2]: [https://github.com/cube2222/jql](https://github.com/cube2222/jql)

~~~
mmalek06
OctoSQL looks very nice and I’ll check it out soon. Do you have Scala bindings
for it?

~~~
mmalek06
Ah, I get it - it’s an uber ORM!

~~~
cube2222
Nah, really just a command line sql query tool. Though could probably be used
as an uber orm.

------
simonw
I've been experimenting with the combo of the "jq" JSON query language and SQL
via a custom SQLite function. I bundled the result as a plugin for my
Datasette project: [https://github.com/simonw/datasette-
jq](https://github.com/simonw/datasette-jq)

Here's an example query: [https://datasette-jq-
demo.datasette.io/demo?sql=select%0D%0A...](https://datasette-jq-
demo.datasette.io/demo?sql=select%0D%0A++package%2C%0D%0A++jq%28%0D%0A++++info%2C%0D%0A++++%22%7Bsummary%3A+.info.summary%2C+author%3A+.info.author%2C+versions%3A+.releases%7Ckeys%7Creverse%7D%22%0D%0A++%29%2C%0D%0A++info%0D%0Afrom%0D%0A++packages)

------
dig1
Apache Drill [1] is alternative, using ANSI SQL. Beside json, it support JDBC,
csv, parquet, file system traversal, HBase, Mongo, S3... I believe it also
support joins across different sources, although I haven't tried that.

EDIT: to add context to the story - Apache Drill is running on JVM and can be
embedded, so it can be run from Scala code as well.

[1] [https://drill.apache.org/docs/querying-json-
files/](https://drill.apache.org/docs/querying-json-files/)

~~~
mmalek06
That looks nice. Can it be used in a program? I mean, not from cli, but inside
code?

------
mmalek06
I recently published a github project with the purpose mentioned in the post's
title. Obviously if you just want to parse JSON, there are much better Scala
libraries, like Circe. However in my case, I had to filter responses from a
very dynamic api that I couldn't change in any way (for example to give me
prefiltered results). My first choice was to use JsonXPath Java lib, but I
didn't like that XPath-like notation, so I figured that some people may
benefit from using what most of us already know - SQL - to create them
queries. In my project, it was supposed to be configurable - the architect
wanted to enable the not so technical business bunch to query json documents
almost without programmers help and the easiest way to do it was to put their
queries into the database and them allow them to match a query against an api.
I left the project some time ago, but the idea I had back them kept me awake
at night and here it is. Hopefully not entirely very bad ;) If you like it, I
could really appreciate some help with development of the rest of some even
cooler functionalities.

------
inshadows
IMHO nothing beats jq[1]. There's Pyjq[2] for Python.

[1] [https://stedolan.github.io/jq/](https://stedolan.github.io/jq/) [2]
[https://github.com/doloopwhile/pyjq](https://github.com/doloopwhile/pyjq)

~~~
bachmeier
I personally find jq to be extremely complicated once you move beyond simple
queries.

~~~
benibela
You could try my Xidel, based on XPath and JSONiq (
[http://www.videlibri.de/xidel.html](http://www.videlibri.de/xidel.html) ).

It is more verbose for simple queries, but with named variables and functions,
larger queries do not become complicated

~~~
inshadows
Xidel is my immediate pick for querying HTML. I use it in several scripts.
Thanks!

I still prefer jq for JSON processing though. I'm hooked to pipelines. EDIT:
The great power of jq is in transforming documents, not querying.

------
mmalek06
I figured, I'll give an update to anyone who's still interested. I managed to
fix a few issues and and some major functionalities (eg: * symbol to select
all) over the last days. The next thing I'll work on will be type queries, so
that the user will be able to use queries as strings but also as a chain of
method calls with a type parameter representing the model returned from the
query. Something like: from(jsonObject).where(list of filter
functions).select[CaseClassType]() . Stay tuned!

------
collyw
SQL isn't the good part about relational databases, its the relational model
that is the good part. The ability to query your data from multiple different
angles.

In saying that, I guess its still better than having yet another custom query
language a la mongo.

~~~
mmalek06
I don't agree :). SQL nicely abstracts what would have been more difficult or
technical to do otherwise. Yep, it was create in order to enable non strictly
technical people to query their data, but does it make it bad? I don't think
so. Also authors of great many NoSql databases (and other technologies, ever
heard about the widely loved LINQ technology in C#?) think, giving their users
the possibility to query data using what they already know. Also: can you
honestly, from the bottom of your heart, say that you'd rather filter json
documents using some form of XPath? Besides: I'm not trying to criticize
relational model, as well as I'm not saying that JSON is the way to go when
storing data :D I just wanted to create a tool for people that are trying to
implement a very specific business case I described and, due to lack of tools,
are having some issues with it.

~~~
collyw
> Also: can you honestly, from the bottom of your heart, say that you'd rather
> filter json documents using some form of XPath?

Come one read the second line of my reply before strawmanning.

~~~
mmalek06
Ok, maybe I got too defensive and assigned to much weight to the first part of
your first sentence. No strawmanning intended :)

------
danellis
This stringified approach loses type safety and IDE assistance for what
benefit? Why do we _want_ it to look like SQL? A Slick/Quill-like approach
seems much safer.

~~~
mmalek06
Please read my description (found somewhere in this discussion). That being
said: there are scenarios when being able to save a query into a database or
config file, then load and run it is useful. In general I agree though and
that’s why one of the enhabcements I want to implement is to create a method
based syntax with projecting selected fields to a case class.

~~~
mmalek06
As for the „why do we want it to look like sql” - I like the syntax and I
think it’s much more readable than any other language used to process data.
But that’s just a matter of taste :)

------
anon176
Pretty cool. You can do this in apache spark already but this looks a lot
quicker to get going with.

~~~
mmalek06
Didn’t know about it. Is it possible to use only that subset of apache spark
lib?

~~~
anon176
Yes, you can run apache spark as a single node really easily. Then once its
running you can fire up the Scala or Python shell that it comes with. After
that, it is just a matter of issuing the statements to setup the data set then
issues queries against it.

~~~
khc
Not many people know this but Databricks offers
[https://community.cloud.databricks.com/](https://community.cloud.databricks.com/)
for free which allows you to run simple spark notebooks.

Disclosure: works for Databricks but not on spark

