Hacker News new | comments | show | ask | jobs | submit login
Show HN: Datasette – Create and publish an API for SQLite databases (simonwillison.net)
269 points by simonw 66 days ago | hide | past | web | favorite | 44 comments

A couple of demos:

https://fivethirtyeight.datasettes.com/ contains SQLite versions of every CSV published by the FiveThirtyEight crew to https://github.com/fivethirtyeight/data

I built a separate tool on top of pandas which can mass-convert CSVs to SQLite - https://github.com/simonw/csvs-to-sqlite - so building the database for https://fivethirtyeight.datasettes.com/ is now a one-liner:

    csvs-to-sqlite ~/fivethirtyeight/data fivethirtyeight.db
https://parlgov.datasettes.com/ uses the SQLite database from http://www.parlgov.org/ - this is a more interesting demo as it includes things like SQL views - e.g. https://parlgov.datasettes.com/parlgov/view_election

Both of these demos are served through Cloudflare - since the URL to each database incorporates a sha256 hash of the database contents, I can send a far-future cache expiry header on everything.

> since the URL to each database incorporates a sha256 hash of the database contents

Small nitpick: It incorporates a SHA-256 prefix, i.e. a greatly reduced SHA-256 hash, so the cryptographic integrity guarantees no longer hold.

> send a far-future cache expiry header on everything

If you really want to do that, you should include the full SHA-256 hash in the URL. If you want to shorten the URL, use a more compact encoding (e.g. Base64 or Ascii85, hex).

Alternatively, make it optional (but I admit this probably defeats the purpose): Short URLs without hash and short expiration, long URLs with hash and eternal expiration.

I don’t need cryptographic guarantees because I’m not dealing with untrusted data: if you want to screw up Datasette’s caching model by deliberately creating two different databases with the same sha256 prefix then go ahead - you’ll only be hurting your own deployment. As such I figured that if a SHA prefix is good enough for common usage of git, it’s good enough for me.

See also: PostgREST, which turns a PostgreSQL database into a rest API.


PostgREST looks fantastic. Anyone used it in production?

I use postgrest for several projects. I have an issue with crashing occasionally (once in months) for no obvious reason (to me). I solved it by continously checking the status and automatically restarting it if it's down. Otherwise great.

Yes, with good results. Anything specific you're curious about?

Hi. I'd like to know how you design the database to separate data available via REST and data that is not.

We use a separate schema for the API (that way we can easily version our APIs as well). The API schemas only contain views/function that expose the tables/fields/etc that we want to expose.

SQL View queries get their own REST endpoints or you can use Postgrest like a language agnostic ORM.

Shameless plug: I made something similar a while back - https://github.com/assafmo/SQLiteProxy

Not as fancy as OP's, but runs great in production since last June :-)

Click is nice, but for my money, Fire is what Click should have been. https://github.com/google/python-fire

I love how optinionated Click is, especially around how CLI tools should be documented. I feel like I built a better tool because Click guided me towards best CLI design practices.

The significant downside to click is the slow startup time. Seems to add more than a second to startup in most cases! Still love it though.

shameless plug (for MySql) : https://github.com/o1lab/xmysql

  XMYSQL : One command to generate REST APIs for any MySql 

  > 20+ releases to npm in last 2 weeks. 
  > Have added 'a ton' of features since last shown on HN.

Asynchronous, easily configurable at runtime, and an exemplar of Python code by one of the creators of Django. Quite a treat.

Does Commodore own the trademark for the Datasette name?

This was trademarked by Audio Devices in Stamford, CT, in 1971, which makes sense since the Commodore Datasette likely didn't exist until the Commodore PET was created in 1977. Commodore seemed to use both 'datasette' and 'datassette'. Perhaps Commodore registered the double-S spelling. Perhaps they were challenged by Audio Devices for using the 'datasette' spelling.

well that brought on a surge of nostalgia.

listening to the sound of your programs and data at a few hundred (dozens?) bytes per second :)

Any examples of how it handles pagination? Couldn't find it in the README.md

Good catch, I haven't documented that yet.

There's a hard limit of 1,000 rows allowed to be returned at any one time. This is because the overhead of serializing more JSON that than could potentially lock up the server. If you try to pull more than that back you get a warning that the data has been truncated, e.g. https://fivethirtyeight.datasettes.com/fivethirtyeight-2628d... - in the JSON that shows up as "truncated": true

If that happens to you using raw SQL, it's up to you to implement offset/limit pagination.

If you are looking at a view, the tool implements offset/limit pagination for you e.g. https://parlgov.datasettes.com/parlgov-25f9855/view_election...

For iterating through a raw table, a more efficient mechanism is used: cursor-based pagination. The table is ordered by its primary key (or its rowid if no primary key was specified) and the "next" links say "get the next 100 items where the primary key is greater than this value" - e.g. https://parlgov.datasettes.com/parlgov-25f9855/cabinet_party...

This means even a table with millions of records can be efficiently paginated through - unlike offset/limit which tends to get slow after a few hundred thousand records.

Nice! Congrats on shipping. Any particular reason it's tied closely together with sqlite rather sqlalchemy which could potentially enable a range of backends? I guess to ensure immutability?

The "REST API for your database" thing isn't actually the part of this I'm most excited about - there are quite a few entries in that category already, such as https://github.com/jeffknupp/sandman2

The thing that excites me is being able to take almost any dataset, package it up into a simple, well-understood format and then deploy that in a read-only, highly scalable way, taking advantage of modern containerization systems.

Read-only SQLite is /perfect/ for this use-case. It's fast, widely supported, has a great feature-set and can be distributed as a single file.

My favourite feature of Datasette isn't the web UI, it's the publish command:

    datasette publish now ~/path-to-my/sqlite.db
Combined with my https://github.com/simonw/csvs-to-sqlite tool it means you can take a CSV file and turn it into a live-on-the-internet REST SQL API in literally minutes. Here are a bunch of datasets that I've done that to just today: https://github.com/simonw/datasette/wiki/Datasettes

I'm absolutely, unequivocally not trying to be snarky, but isn't CSV itself perfect for this use-case?

CSV is a pretty terrible format for storing and transmitting data: it only supports string types, the escaping rules are poorly understood, and it can only model a single flat table.

CSV files have a single order and no support for indexes. Even if you sort your CSV on one or more interesting columns and use binary search to locate an interesting range, you can still only do this along a single dimension, any remaining queries must scan the entire file. You'd also have to write all the code for this yourself. SQLite is a much better solution

Only if you want the entire dataset. Datasette allows you to query for a subset of data.

This is really cool! It'd be fun to have a GraphQL variant of this.

Is rewriting in GraphQL the new rewriting in Rust?

I'd also like to see the GraphQL version of this, mainly because of GraphQL's strongly typed schema definitions. Means less work required to validate inputs.

Congrats on shipping!

I have a sqlite DB with historical Visa data, and I want to expose it as an API (like your software does), but with API key. Not a python user here, can you tell me about the following?

1. Is it possible to put some security behind the API? So only queries with a key can access it?

2. How would I host this API? Where would the SQLite DB reside?

3. I understand the API is read-only, which is great and works for my use case! Can I also do read-writes to the database from a separate application that will keep updating the database on a nightly basis?

I've not considered authentication at all. If you do need that, my preferred option would be to put it behind a proxy such as nginx that can be configured to require authentication.

I've had good luck with this library to use OpenID Connect with Nginx: https://github.com/pingidentity/lua-resty-openidc

You would run this on the same VM or other compute as your SQLite database. You should still be able to perform scheduled updates (cron?) to the underlying SQLite database.

The database is opened immutably, so my guess is you’d either make a copy and update or regenerate it from another source (a la Simon’s CSV to sqlite tool. You could have a symlink that points to the current DB

Congrats for the launch! I was searching for such tool exact yesterday:)

I understand the benefits of Immutable pattern, but it'll be nice if datasettle could provide interface with single non-frequent writer(for example, most FRED data is updated seasonally, some of them updates daily) so the pipeline pulling data from some source could be automated

The immutable thing is pretty key to the whole project - if you do need a writable API there are a few options out there already though. https://github.com/coleifer/sqlite-web looks particularly good.

Cool little utility. Like the fact you can pass SQL in the query string. Have to look for a use for this. Maybe internal use apps that update SQLite periodically from a larger database and just need a view of data. Thanks for sharing.

Disappointed this doesn't work on Windows, looks really cool.

The name is brilliant.

This looks great except for the fact that it runs through cloudflare, that's a big negative for me and will restrict a little my ability to deploy this. However, I can still see some great applications of this that I can't wait to try out.

Cloudflare is absolutely not a requirement - https://polar-bears.now.sh/ for example is running directly on now.sh, and you can deploy a datasette anywhere that can run a Docker container (or just use plain Python 3).

Because the data is meant to be immutable, Datasette works extremely well with any caching proxy. Cloudflare just happens to have the most generous free tier, but self-hosted Varnish or Fastly would work just as well as Cloudflare, if not better.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact