Joining CSV and JSON data with an in-memory SQLite database

simonw · on June 19, 2021

This post is about a new feature in my sqlite-utils tool, which is a combination CLI utility and Python library for productively working with SQLite database files: https://sqlite-utils.datasette.io/en/latest/changelog.html#v...

DemocracyFTW · on June 20, 2021

Would you like to speak to the rationale to incorporate a sort-of ORM into Datasette? According to the docs, I can iterate over a subset of rows by writing

  for row in db[ 'dogs' ].rows_where( 'age > 1', select = 'name, age', order_by = 'age desc' ): ...

but how is that better than just writing the equivalent SQL statement,

  sql = "select name, age from dogs where age > 1 order by age desc;"
  for row in ...:

Most of the arguments in the pythonized form are, after all, just SQL fragments, so it's not like you get column name parametrization (you'd still have to properly escape and concatenate the `select` argument).

Perusing the docs I found that considerable effort is spent on explaining this and other ORM-like features, instead of just saying "use the SQL you already know" (people without this knowledge won't be easily able to use this API productively anyway).

simonw · on June 20, 2021

sqlite-utils isn't (yet) a dependency of Datasette.

The goal with sqlite-utils isn't to build an ORM - that .rows_where() method is definitely the most ORM-like piece of it, and it's purely there as a SQL-builder - it was a natural extension of the .rows property, which grew extra features (like order_by) over time as they were requested by users, eg https://github.com/simonw/sqlite-utils/issues/76

The vast majority of the library is aimed at making getting data IN to SQLite as easy as possible. Datasette is mainly about executing SELECT queries, and I found myself writing a lot of code to populate those database files - which grew into a combination library and CLI tool.

So yeah - I don't use the rows_where() feature much in my own code - I tend to use db.query() directly - but I've evolved the method over time based on user feedback. I don't have particularly strong opinions about it one way or the other.

DemocracyFTW · on June 20, 2021

Ah OK that's interesting. I would've surmised the `.rows_where()` API having grown out of your own need to repeatedly stitch together pieces of SQL for dynamic queries, but it apparently did not go that way.

raziel2p · on June 20, 2021

This tool seems amazing. I'm kinda upset I haven't discovered it before.

mongol · on June 20, 2021

Before doing the select I would probably want to show the schema. Since the JSON could be arbitrarily complex... How would I do that?

simonw · on June 20, 2021

Good point, I should add that to the docs. You can see the schema in a couple of ways:

    sqlite-utils memory blah.json "select sql from sqlite_master"

Or

    sqlite-utils memory blah.json --dump

The second option will dump out both the SQL schema and the INSERT statements, so you probably want to | head that.

You've inspired me to add a --schema option which does the dump without also including the inserted rows. Issue here: https://github.com/simonw/sqlite-utils/issues/288

simonw · on June 20, 2021

I just released the new `--schema` option in sqlite-utils 3.11: https://sqlite-utils.datasette.io/en/latest/changelog.html#v...

mongol · on June 20, 2021

Really cool, great improvement!

edoceo · on June 20, 2021

Side note: Sqlite mem-only DBs are amazing for working with dataset you want to load then get counts and groups and other fun things. Loops and other stuff in your code get replaced by (cleaner?) SQL.

It's also way easier to see what you were doing 6mo later and the performance is not terrible.

bob1029 · on June 20, 2021

We use SQLite in-memory databases for executing 100% of our business logic these days. Letting the business write all the rules in SQL is the biggest win of my career so far.

Also, if you think SQLite might be too constrained for your business case, you can expose any arbitrary application function to it. E.g.:

https://docs.microsoft.com/en-us/dotnet/standard/data/sqlite...

The very first thing we did was pipe DateTime into SQLite as a UDF. Imagine instantly having the full power of .NET6 available from inside SQLite.

Note that these functions do NOT necessarily have to avoid side effects either. You can use a procedural DSL via SELECT statements that invokes any arbitrary business method with whatever parameters from the domain data.

The process is so simple I am actually disappointed that we didn't think of it sooner. You just put a template database in memory w/ the schema pre-loaded, then make a copy of this each time you want to map domain state for SQL execution.

You can do conditionals, strings, arrays of strings, arrays of CSVs, etc. Any shape of thing you need to figure out a conditional or dynamic presentation of business facts.

Oh and you can also use views to build arbitrary layers of abstraction so the business can focus on their relevant pieces.

justsomeuser · on June 20, 2021

Some questions:

- Who is writing the queries, and what interface do they use?

Are the SQL queries known at compile time, or does the user provide them to your compiled .NET program at runtime?

- What does the SQLite SQL dialect give you that Linq/functions does not?

bob1029 · on June 20, 2021

Queries are managed via a web interface. Nothing is known at compile time.

> What does the SQLite SQL dialect give you that LINQ/functions do not?

It's not about SQLite's specific dialect. It's just about SQL. The relational algebra/calculi are capable of expressing any degree of complexity. LINQ (functions) require compile-time, which breaks our objectives.

fifilura · on June 20, 2021

I agree, I was once the on the implementation side of this. It was lovely. All day writing pure, terse, bug-free logic. Until that utopia started to itch...

But the experience helped me to look for the SQL patterns in the logic of the codebase I am working with.

Often there is very little. Mostly meaning that the rest is just an annoying heap of plumbing. It is not like I can magically make it go away, but it still seems unnecessary to me.

jiofih · on June 20, 2021

What data size are you dealing with? I imagine this gets prohibitive in both performance and cost as it gets into the high GBs.

simonw · on June 20, 2021

My rule of thumb at the moment is that for anything up to 10GB of data SQLite basically Just Works. For 10-100GB it will work if you design your indexes carefully. Above 100GB gets harder - my hunch is that there are various tricks and optimizations I've not yet discovered to help make that work OK, but switching to PostgreSQL is probably easier for that kind of scale.

bob1029 · on June 20, 2021

I've got SQLite databases in production that are well beyond 100 gigabytes. These size limits are completely meaningless without any background on actual usage patterns & business cases.

There is no arbitrary point in database size (prior to the exact stated maximums [1]) at which SQLite just magically starts sucking ass for no good reason.

In the (very common) case of a single node, single tenant database server, you will never be able to extract more throughput from that box with a hosted solution over a well-tuned SQLite solution running inside the application binary. It is simply impossible to overcome the latency & other overhead imposed by all hosted SQL solutions. SQLite operations are effectively a direct method invocation. Microseconds, if that. Anything touching the network stack will start you off with 10-1000x more latency.

Unless you can prove you will ever need more capabilities than a single server can offer, SQLite is clearly the best engineering choice.

  [1]: https://www.sqlite.org/limits.html

simonw · on June 21, 2021

That's really useful, thanks.

My rules of thumb are based entirely off experiments I've done with Datasette, which tends towards ad-hoc querying, often without the best indexes and with a LOT of group-by/count queries to implement faceting.

You've made me realize that those rules of thumb (which are pretty unscientific already) likely don't apply at all to projects outside of Datasette, so I should probably keep them to myself!

bob1029 · on June 20, 2021

Very small data sizes. Each database is ephemeral and only scoped to a specific user's session of work.

We didn't even bother with any indexing, because there are very few tables where we would exceed 100 rows.

rattray · on June 20, 2021

This sounds very cool!

I'm curious what kind of business you're in?

bob1029 · on June 20, 2021

Financial services.

pletnes · on June 20, 2021

I’ve heard this before, but what do you get over pandas, which has groupby, filtering etc? Skip one dependency? Usage in other languages than python?

dgb23 · on June 20, 2021

If pandas gives you a full relational API with arbitrary data then in isolation it doesn’t make much of a difference. SQL is more portable so to speak, but a library like that may introduce less friction. Pragmatism is advised here.

The big idea here is to use relational logic programming to express data transformation outside of storage access. The paper „Out of the Tar Pit“ proposed this as a way to reduce accidental complexity.

atoav · on June 20, 2021

Sqlite has probably much wider (and more stable) support with other languages than pandas.

Also it is a different thing. Pandas is very nice to do data analytics or crunch numbers on reocurring data, but I wouldn't replace a database with it.

pletnes · on June 20, 2021

I agree that it can’t replace a database, for most use cases. But an in-memory database I’ve yet to find a «real» usecase for.

selimnairb · on June 20, 2021

I recently started using pandas to do groupby and aggregation. It’s nice to have, but the whole time I kept wishing it would just let me run an SQL query without adding another dependency. Having learned SQL long ago, I find it to be much more intuitive and expressive. I guess it’s all just what you know.

pletnes · on June 20, 2021

Pandas does have read/write to sqlite. You can dump a dataframe to sqlite, transform, then load from sql again. If it’s worth it depends on your case I guess.

eatonphil · on June 20, 2021

I recently started looking into the most mature SQL database for the browser. My goal was similar, to be able take data as json or HTTP responses or csv or whatever and be able to pass them through a SQL engine with the user providing custom code/data the whole way.

Turns out the most mature JavaScript database is SQLite ported to webassembly: https://github.com/sql-js/sql.js.

Based on that research, I wrote a bit about running SQL and other languages entirely in the browser here: https://datastation.multiprocess.io/blog/2021-06-16-language...

smartmic · on June 20, 2021

I do not know sqlite-utils, but my first thought was: why not do it purely using jq and sqlite3's CLI ? Okay, this is interactive but so are almost all of my exploratory data analysis tasks.

Just a few steps: convert JSON to CSV with jq, fire up interactive sqlite3 CLI which connects to in-memory database by default, run two .import FILE TABLE statements and finally the SQL query.

Alternatively one could use the SQLite JSON1 extension and write a small script if the task should be automated (and you do not like jq's syntax for JSON->CSV conversion)

simonw · on June 20, 2021

Because that's more steps. Everything you can do with the new "sqlite-utils memory" command is possible without it, but now you can get a ton of stuff done with a single tool in a one-line shell script (as opposed to several lines that tue together several different tools).

2ion · on June 20, 2021

I've been going at this kind of task for quite some time now most frequently with pandas as the "tee" kind of tools to pull different kinds of data sources together. This looks neat as well though and probably more convenient in a shell-style & scripting environment. Might be useful e.g. when doing things in cloud-init or sth - a verstile tool to stick into VM base iamges.

mihaitodor · on June 20, 2021

Sounds like a great use case for https://www.benthos.dev/