Both of these demos are served through Cloudflare - since the URL to each database incorporates a sha256 hash of the database contents, I can send a far-future cache expiry header on everything.
> since the URL to each database incorporates a sha256 hash of the database contents
Small nitpick: It incorporates a SHA-256 prefix, i.e. a greatly reduced SHA-256 hash, so the cryptographic integrity guarantees no longer hold.
> send a far-future cache expiry header on everything
If you really want to do that, you should include the full SHA-256 hash in the URL. If you want to shorten the URL, use a more compact encoding (e.g. Base64 or Ascii85, hex).
Alternatively, make it optional (but I admit this probably defeats the purpose): Short URLs without hash and short expiration, long URLs with hash and eternal expiration.
I don’t need cryptographic guarantees because I’m not dealing with untrusted data: if you want to screw up Datasette’s caching model by deliberately creating two different databases with the same sha256 prefix then go ahead - you’ll only be hurting your own deployment. As such I figured that if a SHA prefix is good enough for common usage of git, it’s good enough for me.
I use postgrest for several projects. I have an issue with crashing occasionally (once in months) for no obvious reason (to me). I solved it by continously checking the status and automatically restarting it if it's down. Otherwise great.
We use a separate schema for the API (that way we can easily version our APIs as well). The API schemas only contain views/function that expose the tables/fields/etc that we want to expose.
I love how optinionated Click is, especially around how CLI tools should be documented. I feel like I built a better tool because Click guided me towards best CLI design practices.
Nice! Congrats on shipping. Any particular reason it's tied closely together with sqlite rather sqlalchemy which could potentially enable a range of backends? I guess to ensure immutability?
The "REST API for your database" thing isn't actually the part of this I'm most excited about - there are quite a few entries in that category already, such as https://github.com/jeffknupp/sandman2
The thing that excites me is being able to take almost any dataset, package it up into a simple, well-understood format and then deploy that in a read-only, highly scalable way, taking advantage of modern containerization systems.
Read-only SQLite is /perfect/ for this use-case. It's fast, widely supported, has a great feature-set and can be distributed as a single file.
My favourite feature of Datasette isn't the web UI, it's the publish command:
CSV is a pretty terrible format for storing and transmitting data: it only supports string types, the escaping rules are poorly understood, and it can only model a single flat table.
CSV files have a single order and no support for indexes. Even if you sort your CSV on one or more interesting columns and use binary search to locate an interesting range, you can still only do this along a single dimension, any remaining queries must scan the entire file. You'd also have to write all the code for this yourself. SQLite is a much better solution
XMYSQL : One command to generate REST APIs for any MySql
Database.
> 20+ releases to npm in last 2 weeks.
> Have added 'a ton' of features since last shown on HN.
This was trademarked by Audio Devices in Stamford, CT, in 1971, which makes sense since the Commodore Datasette likely didn't exist until the Commodore PET was created in 1977. Commodore seemed to use both 'datasette' and 'datassette'. Perhaps Commodore registered the double-S spelling. Perhaps they were challenged by Audio Devices for using the 'datasette' spelling.
There's a hard limit of 1,000 rows allowed to be returned at any one time. This is because the overhead of serializing more JSON that than could potentially lock up the server. If you try to pull more than that back you get a warning that the data has been truncated, e.g. https://fivethirtyeight.datasettes.com/fivethirtyeight-2628d... - in the JSON that shows up as "truncated": true
If that happens to you using raw SQL, it's up to you to implement offset/limit pagination.
For iterating through a raw table, a more efficient mechanism is used: cursor-based pagination. The table is ordered by its primary key (or its rowid if no primary key was specified) and the "next" links say "get the next 100 items where the primary key is greater than this value" - e.g. https://parlgov.datasettes.com/parlgov-25f9855/cabinet_party...
This means even a table with millions of records can be efficiently paginated through - unlike offset/limit which tends to get slow after a few hundred thousand records.
I'd also like to see the GraphQL version of this, mainly because of GraphQL's strongly typed schema definitions. Means less work required to validate inputs.
Congrats for the launch! I was searching for such tool exact yesterday:)
I understand the benefits of Immutable pattern, but it'll be nice if datasettle could provide interface with single non-frequent writer(for example, most FRED data is updated seasonally, some of them updates daily) so the pipeline pulling data from some source could be automated
The immutable thing is pretty key to the whole project - if you do need a writable API there are a few options out there already though. https://github.com/coleifer/sqlite-web looks particularly good.
Cool little utility. Like the fact you can pass SQL in the query string. Have to look for a use for this. Maybe internal use apps that update SQLite periodically from a larger database and just need a view of data. Thanks for sharing.
I have a sqlite DB with historical Visa data, and I want to expose it as an API (like your software does), but with API key. Not a python user here, can you tell me about the following?
1. Is it possible to put some security behind the API? So only queries with a key can access it?
2. How would I host this API? Where would the SQLite DB reside?
3. I understand the API is read-only, which is great and works for my use case! Can I also do read-writes to the database from a separate application that will keep updating the database on a nightly basis?
I've not considered authentication at all. If you do need that, my preferred option would be to put it behind a proxy such as nginx that can be configured to require authentication.
You would run this on the same VM or other compute as your SQLite database. You should still be able to perform scheduled updates (cron?) to the underlying SQLite database.
The database is opened immutably, so my guess is you’d either make a copy and update or regenerate it from another source (a la Simon’s CSV to sqlite tool. You could have a symlink that points to the current DB
This looks great except for the fact that it runs through cloudflare, that's a big negative for me and will restrict a little my ability to deploy this. However, I can still see some great applications of this that I can't wait to try out.
Cloudflare is absolutely not a requirement - https://polar-bears.now.sh/ for example is running directly on now.sh, and you can deploy a datasette anywhere that can run a Docker container (or just use plain Python 3).
Because the data is meant to be immutable, Datasette works extremely well with any caching proxy. Cloudflare just happens to have the most generous free tier, but self-hosted Varnish or Fastly would work just as well as Cloudflare, if not better.
https://fivethirtyeight.datasettes.com/ contains SQLite versions of every CSV published by the FiveThirtyEight crew to https://github.com/fivethirtyeight/data
I built a separate tool on top of pandas which can mass-convert CSVs to SQLite - https://github.com/simonw/csvs-to-sqlite - so building the database for https://fivethirtyeight.datasettes.com/ is now a one-liner:
https://parlgov.datasettes.com/ uses the SQLite database from http://www.parlgov.org/ - this is a more interesting demo as it includes things like SQL views - e.g. https://parlgov.datasettes.com/parlgov/view_electionBoth of these demos are served through Cloudflare - since the URL to each database incorporates a sha256 hash of the database contents, I can send a far-future cache expiry header on everything.