Would you like to speak to the rationale to incorporate a sort-of ORM into Datasette? According to the docs, I can iterate over a subset of rows by writing
for row in db[ 'dogs' ].rows_where( 'age > 1', select = 'name, age', order_by = 'age desc' ): ...
but how is that better than just writing the equivalent SQL statement,
sql = "select name, age from dogs where age > 1 order by age desc;"
for row in ...:
Most of the arguments in the pythonized form are, after all, just SQL fragments, so it's not like you get column name parametrization (you'd still have to properly escape and concatenate the `select` argument).
Perusing the docs I found that considerable effort is spent on explaining this and other ORM-like features, instead of just saying "use the SQL you already know" (people without this knowledge won't be easily able to use this API productively anyway).
sqlite-utils isn't (yet) a dependency of Datasette.
The goal with sqlite-utils isn't to build an ORM - that .rows_where() method is definitely the most ORM-like piece of it, and it's purely there as a SQL-builder - it was a natural extension of the .rows property, which grew extra features (like order_by) over time as they were requested by users, eg https://github.com/simonw/sqlite-utils/issues/76
The vast majority of the library is aimed at making getting data IN to SQLite as easy as possible. Datasette is mainly about executing SELECT queries, and I found myself writing a lot of code to populate those database files - which grew into a combination library and CLI tool.
So yeah - I don't use the rows_where() feature much in my own code - I tend to use db.query() directly - but I've evolved the method over time based on user feedback. I don't have particularly strong opinions about it one way or the other.
Ah OK that's interesting. I would've surmised the `.rows_where()` API having grown out of your own need to repeatedly stitch together pieces of SQL for dynamic queries, but it apparently did not go that way.
Side note: Sqlite mem-only DBs are amazing for working with dataset you want to load then get counts and groups and other fun things. Loops and other stuff in your code get replaced by (cleaner?) SQL.
It's also way easier to see what you were doing 6mo later and the performance is not terrible.
We use SQLite in-memory databases for executing 100% of our business logic these days. Letting the business write all the rules in SQL is the biggest win of my career so far.
Also, if you think SQLite might be too constrained for your business case, you can expose any arbitrary application function to it. E.g.:
The very first thing we did was pipe DateTime into SQLite as a UDF. Imagine instantly having the full power of .NET6 available from inside SQLite.
Note that these functions do NOT necessarily have to avoid side effects either. You can use a procedural DSL via SELECT statements that invokes any arbitrary business method with whatever parameters from the domain data.
The process is so simple I am actually disappointed that we didn't think of it sooner. You just put a template database in memory w/ the schema pre-loaded, then make a copy of this each time you want to map domain state for SQL execution.
You can do conditionals, strings, arrays of strings, arrays of CSVs, etc. Any shape of thing you need to figure out a conditional or dynamic presentation of business facts.
Oh and you can also use views to build arbitrary layers of abstraction so the business can focus on their relevant pieces.
Queries are managed via a web interface. Nothing is known at compile time.
> What does the SQLite SQL dialect give you that LINQ/functions do not?
It's not about SQLite's specific dialect. It's just about SQL. The relational algebra/calculi are capable of expressing any degree of complexity. LINQ (functions) require compile-time, which breaks our objectives.
I agree, I was once the on the implementation side of this. It was lovely. All day writing pure, terse, bug-free logic. Until that utopia started to itch...
But the experience helped me to look for the SQL patterns in the logic of the codebase I am working with.
Often there is very little. Mostly meaning that the rest is just an annoying heap of plumbing. It is not like I can magically make it go away, but it still seems unnecessary to me.
My rule of thumb at the moment is that for anything up to 10GB of data SQLite basically Just Works. For 10-100GB it will work if you design your indexes carefully. Above 100GB gets harder - my hunch is that there are various tricks and optimizations I've not yet discovered to help make that work OK, but switching to PostgreSQL is probably easier for that kind of scale.
I've got SQLite databases in production that are well beyond 100 gigabytes. These size limits are completely meaningless without any background on actual usage patterns & business cases.
There is no arbitrary point in database size (prior to the exact stated maximums [1]) at which SQLite just magically starts sucking ass for no good reason.
In the (very common) case of a single node, single tenant database server, you will never be able to extract more throughput from that box with a hosted solution over a well-tuned SQLite solution running inside the application binary. It is simply impossible to overcome the latency & other overhead imposed by all hosted SQL solutions. SQLite operations are effectively a direct method invocation. Microseconds, if that. Anything touching the network stack will start you off with 10-1000x more latency.
Unless you can prove you will ever need more capabilities than a single server can offer, SQLite is clearly the best engineering choice.
My rules of thumb are based entirely off experiments I've done with Datasette, which tends towards ad-hoc querying, often without the best indexes and with a LOT of group-by/count queries to implement faceting.
You've made me realize that those rules of thumb (which are pretty unscientific already) likely don't apply at all to projects outside of Datasette, so I should probably keep them to myself!
If pandas gives you a full relational API with arbitrary data then in isolation it doesn’t make much of a difference. SQL is more portable so to speak, but a library like that may introduce less friction. Pragmatism is advised here.
The big idea here is to use relational logic programming to express data transformation outside of storage access. The paper „Out of the Tar Pit“ proposed this as a way to reduce accidental complexity.
Sqlite has probably much wider (and more stable) support with other languages than pandas.
Also it is a different thing. Pandas is very nice to do data analytics or crunch numbers on reocurring data, but I wouldn't replace a database with it.
I recently started using pandas to do groupby and aggregation. It’s nice to have, but the whole time I kept wishing it would just let me run an SQL query without adding another dependency. Having learned SQL long ago, I find it to be much more intuitive and expressive. I guess it’s all just what you know.
Pandas does have read/write to sqlite. You can dump a dataframe to sqlite, transform, then load from sql again. If it’s worth it depends on your case I guess.
I recently started looking into the most mature SQL database for the browser. My goal was similar, to be able take data as json or HTTP responses or csv or whatever and be able to pass them through a SQL engine with the user providing custom code/data the whole way.
I do not know sqlite-utils, but my first thought was: why not do it purely using jq and sqlite3's CLI ? Okay, this is interactive but so are almost all of my exploratory data analysis tasks.
Just a few steps: convert JSON to CSV with jq, fire up interactive sqlite3 CLI which connects to in-memory database by default, run two .import FILE TABLE statements and finally the SQL query.
Alternatively one could use the SQLite JSON1 extension and write a small script if the task should be automated (and you do not like jq's syntax for JSON->CSV conversion)
Because that's more steps. Everything you can do with the new "sqlite-utils memory" command is possible without it, but now you can get a ton of stuff done with a single tool in a one-line shell script (as opposed to several lines that tue together several different tools).
I've been going at this kind of task for quite some time now most frequently with pandas as the "tee" kind of tools to pull different kinds of data sources together. This looks neat as well though and probably more convenient in a shell-style & scripting environment. Might be useful e.g. when doing things in cloud-init or sth - a verstile tool to stick into VM base iamges.