Show HN: Datasette – Create and publish an API for SQLite databases

simonw · on Nov 14, 2017

A couple of demos:

https://fivethirtyeight.datasettes.com/ contains SQLite versions of every CSV published by the FiveThirtyEight crew to https://github.com/fivethirtyeight/data

I built a separate tool on top of pandas which can mass-convert CSVs to SQLite - https://github.com/simonw/csvs-to-sqlite - so building the database for https://fivethirtyeight.datasettes.com/ is now a one-liner:

    csvs-to-sqlite ~/fivethirtyeight/data fivethirtyeight.db

https://parlgov.datasettes.com/ uses the SQLite database from http://www.parlgov.org/ - this is a more interesting demo as it includes things like SQL views - e.g. https://parlgov.datasettes.com/parlgov/view_election

Both of these demos are served through Cloudflare - since the URL to each database incorporates a sha256 hash of the database contents, I can send a far-future cache expiry header on everything.

vog · on Nov 14, 2017

> since the URL to each database incorporates a sha256 hash of the database contents

Small nitpick: It incorporates a SHA-256 prefix, i.e. a greatly reduced SHA-256 hash, so the cryptographic integrity guarantees no longer hold.

> send a far-future cache expiry header on everything

If you really want to do that, you should include the full SHA-256 hash in the URL. If you want to shorten the URL, use a more compact encoding (e.g. Base64 or Ascii85, hex).

Alternatively, make it optional (but I admit this probably defeats the purpose): Short URLs without hash and short expiration, long URLs with hash and eternal expiration.

simonw · on Nov 14, 2017

I don’t need cryptographic guarantees because I’m not dealing with untrusted data: if you want to screw up Datasette’s caching model by deliberately creating two different databases with the same sha256 prefix then go ahead - you’ll only be hurting your own deployment. As such I figured that if a SHA prefix is good enough for common usage of git, it’s good enough for me.

hamandcheese · on Nov 14, 2017

See also: PostgREST, which turns a PostgreSQL database into a rest API.

https://postgrest.com/

awb · on Nov 14, 2017

PostgREST looks fantastic. Anyone used it in production?

michalskop · on Nov 14, 2017

I use postgrest for several projects. I have an issue with crashing occasionally (once in months) for no obvious reason (to me). I solved it by continously checking the status and automatically restarting it if it's down. Otherwise great.

hucker · on Nov 14, 2017

Yes, with good results. Anything specific you're curious about?

patricius · on Nov 14, 2017

Hi. I'd like to know how you design the database to separate data available via REST and data that is not.

hucker · on Nov 14, 2017

We use a separate schema for the API (that way we can easily version our APIs as well). The API schemas only contain views/function that expose the tables/fields/etc that we want to expose.

evv555 · on Nov 14, 2017

SQL View queries get their own REST endpoints or you can use Postgrest like a language agnostic ORM.

stedalus · on Nov 14, 2017

Click is nice, but for my money, Fire is what Click should have been. https://github.com/google/python-fire

simonw · on Nov 14, 2017

I love how optinionated Click is, especially around how CLI tools should be documented. I feel like I built a better tool because Click guided me towards best CLI design practices.

pletnes · on Nov 14, 2017

The significant downside to click is the slow startup time. Seems to add more than a second to startup in most cases! Still love it though.

tomcam · on Nov 14, 2017

Asynchronous, easily configurable at runtime, and an exemplar of Python code by one of the creators of Django. Quite a treat.

assafmo · on Nov 14, 2017

Shameless plug: I made something similar a while back - https://github.com/assafmo/SQLiteProxy

Not as fancy as OP's, but runs great in production since last June :-)

ptype · on Nov 14, 2017

Nice! Congrats on shipping. Any particular reason it's tied closely together with sqlite rather sqlalchemy which could potentially enable a range of backends? I guess to ensure immutability?

simonw · on Nov 14, 2017

The "REST API for your database" thing isn't actually the part of this I'm most excited about - there are quite a few entries in that category already, such as https://github.com/jeffknupp/sandman2

The thing that excites me is being able to take almost any dataset, package it up into a simple, well-understood format and then deploy that in a read-only, highly scalable way, taking advantage of modern containerization systems.

Read-only SQLite is /perfect/ for this use-case. It's fast, widely supported, has a great feature-set and can be distributed as a single file.

My favourite feature of Datasette isn't the web UI, it's the publish command:

    datasette publish now ~/path-to-my/sqlite.db

Combined with my https://github.com/simonw/csvs-to-sqlite tool it means you can take a CSV file and turn it into a live-on-the-internet REST SQL API in literally minutes. Here are a bunch of datasets that I've done that to just today: https://github.com/simonw/datasette/wiki/Datasettes

stdbrouw · on Nov 14, 2017

I'm absolutely, unequivocally not trying to be snarky, but isn't CSV itself perfect for this use-case?

simonw · on Nov 14, 2017

CSV is a pretty terrible format for storing and transmitting data: it only supports string types, the escaping rules are poorly understood, and it can only model a single flat table.

_wmd · on Nov 14, 2017

CSV files have a single order and no support for indexes. Even if you sort your CSV on one or more interesting columns and use binary search to locate an interesting range, you can still only do this along a single dimension, any remaining queries must scan the entire file. You'd also have to write all the code for this yourself. SQLite is a much better solution

lukechilds · on Nov 14, 2017

Only if you want the entire dataset. Datasette allows you to query for a subset of data.

o1lab · on Nov 14, 2017

shameless plug (for MySql) : https://github.com/o1lab/xmysql

  XMYSQL : One command to generate REST APIs for any MySql 
  Database.

  > 20+ releases to npm in last 2 weeks. 
  > Have added 'a ton' of features since last shown on HN.

timbit42 · on Nov 14, 2017

Does Commodore own the trademark for the Datasette name?

mastax · on Nov 14, 2017

Expired in 1993: http://tsdr.uspto.gov/#caseNumber=72389457&caseType=SERIAL_N...

timbit42 · on Nov 14, 2017

This was trademarked by Audio Devices in Stamford, CT, in 1971, which makes sense since the Commodore Datasette likely didn't exist until the Commodore PET was created in 1977. Commodore seemed to use both 'datasette' and 'datassette'. Perhaps Commodore registered the double-S spelling. Perhaps they were challenged by Audio Devices for using the 'datasette' spelling.

bazzert · on Nov 14, 2017

well that brought on a surge of nostalgia.

ifdefdebug · on Nov 14, 2017

listening to the sound of your programs and data at a few hundred (dozens?) bytes per second :)

anonfunction · on Nov 14, 2017

Any examples of how it handles pagination? Couldn't find it in the README.md

simonw · on Nov 14, 2017

Good catch, I haven't documented that yet.

There's a hard limit of 1,000 rows allowed to be returned at any one time. This is because the overhead of serializing more JSON that than could potentially lock up the server. If you try to pull more than that back you get a warning that the data has been truncated, e.g. https://fivethirtyeight.datasettes.com/fivethirtyeight-2628d... - in the JSON that shows up as "truncated": true

If that happens to you using raw SQL, it's up to you to implement offset/limit pagination.

If you are looking at a view, the tool implements offset/limit pagination for you e.g. https://parlgov.datasettes.com/parlgov-25f9855/view_election...

For iterating through a raw table, a more efficient mechanism is used: cursor-based pagination. The table is ordered by its primary key (or its rowid if no primary key was specified) and the "next" links say "get the next 100 items where the primary key is greater than this value" - e.g. https://parlgov.datasettes.com/parlgov-25f9855/cabinet_party...

This means even a table with millions of records can be efficiently paginated through - unlike offset/limit which tends to get slow after a few hundred thousand records.

dangoor · on Nov 14, 2017

This is really cool! It'd be fun to have a GraphQL variant of this.

slaymaker1907 · on Nov 14, 2017

Is rewriting in GraphQL the new rewriting in Rust?

ZenoArrow · on Nov 14, 2017

I'd also like to see the GraphQL version of this, mainly because of GraphQL's strongly typed schema definitions. Means less work required to validate inputs.

wsxiaoys · on Nov 14, 2017

Congrats for the launch! I was searching for such tool exact yesterday:)

I understand the benefits of Immutable pattern, but it'll be nice if datasettle could provide interface with single non-frequent writer(for example, most FRED data is updated seasonally, some of them updates daily) so the pipeline pulling data from some source could be automated

simonw · on Nov 14, 2017

The immutable thing is pretty key to the whole project - if you do need a writable API there are a few options out there already though. https://github.com/coleifer/sqlite-web looks particularly good.

mythrwy · on Nov 14, 2017

Cool little utility. Like the fact you can pass SQL in the query string. Have to look for a use for this. Maybe internal use apps that update SQLite periodically from a larger database and just need a view of data. Thanks for sharing.

ccarse · on Nov 14, 2017

Disappointed this doesn't work on Windows, looks really cool.

dgudkov · on Nov 14, 2017

The name is brilliant.

justboxing · on Nov 14, 2017

Congrats on shipping!

I have a sqlite DB with historical Visa data, and I want to expose it as an API (like your software does), but with API key. Not a python user here, can you tell me about the following?

1. Is it possible to put some security behind the API? So only queries with a key can access it?

2. How would I host this API? Where would the SQLite DB reside?

3. I understand the API is read-only, which is great and works for my use case! Can I also do read-writes to the database from a separate application that will keep updating the database on a nightly basis?

simonw · on Nov 14, 2017

I've not considered authentication at all. If you do need that, my preferred option would be to put it behind a proxy such as nginx that can be configured to require authentication.

malingo · on Nov 14, 2017

I've had good luck with this library to use OpenID Connect with Nginx: https://github.com/pingidentity/lua-resty-openidc

toomuchtodo · on Nov 14, 2017

You would run this on the same VM or other compute as your SQLite database. You should still be able to perform scheduled updates (cron?) to the underlying SQLite database.

technomalogical · on Nov 15, 2017

The database is opened immutably, so my guess is you’d either make a copy and update or regenerate it from another source (a la Simon’s CSV to sqlite tool. You could have a symlink that points to the current DB

rmbeard · on Nov 14, 2017

This looks great except for the fact that it runs through cloudflare, that's a big negative for me and will restrict a little my ability to deploy this. However, I can still see some great applications of this that I can't wait to try out.

simonw · on Nov 14, 2017

Cloudflare is absolutely not a requirement - https://polar-bears.now.sh/ for example is running directly on now.sh, and you can deploy a datasette anywhere that can run a Docker container (or just use plain Python 3).

Because the data is meant to be immutable, Datasette works extremely well with any caching proxy. Cloudflare just happens to have the most generous free tier, but self-hosted Varnish or Fastly would work just as well as Cloudflare, if not better.