
Show HN: Datasette – Create and publish an API for SQLite databases - simonw
https://simonwillison.net/2017/Nov/13/datasette/
======
simonw
A couple of demos:

[https://fivethirtyeight.datasettes.com/](https://fivethirtyeight.datasettes.com/)
contains SQLite versions of every CSV published by the FiveThirtyEight crew to
[https://github.com/fivethirtyeight/data](https://github.com/fivethirtyeight/data)

I built a separate tool on top of pandas which can mass-convert CSVs to SQLite
- [https://github.com/simonw/csvs-to-sqlite](https://github.com/simonw/csvs-
to-sqlite) \- so building the database for
[https://fivethirtyeight.datasettes.com/](https://fivethirtyeight.datasettes.com/)
is now a one-liner:

    
    
        csvs-to-sqlite ~/fivethirtyeight/data fivethirtyeight.db
    

[https://parlgov.datasettes.com/](https://parlgov.datasettes.com/) uses the
SQLite database from [http://www.parlgov.org/](http://www.parlgov.org/) \-
this is a more interesting demo as it includes things like SQL views - e.g.
[https://parlgov.datasettes.com/parlgov/view_election](https://parlgov.datasettes.com/parlgov/view_election)

Both of these demos are served through Cloudflare - since the URL to each
database incorporates a sha256 hash of the database contents, I can send a
far-future cache expiry header on everything.

~~~
vog
_> since the URL to each database incorporates a sha256 hash of the database
contents_

Small nitpick: It incorporates a SHA-256 prefix, i.e. a greatly reduced
SHA-256 hash, so the cryptographic integrity guarantees no longer hold.

 _> send a far-future cache expiry header on everything_

If you really want to do that, you should include the full SHA-256 hash in the
URL. If you want to shorten the URL, use a more compact encoding (e.g. Base64
or Ascii85, hex).

Alternatively, make it optional (but I admit this probably defeats the
purpose): Short URLs without hash and short expiration, long URLs with hash
and eternal expiration.

~~~
simonw
I don’t need cryptographic guarantees because I’m not dealing with untrusted
data: if you want to screw up Datasette’s caching model by deliberately
creating two different databases with the same sha256 prefix then go ahead -
you’ll only be hurting your own deployment. As such I figured that if a SHA
prefix is good enough for common usage of git, it’s good enough for me.

------
hamandcheese
See also: PostgREST, which turns a PostgreSQL database into a rest API.

[https://postgrest.com/](https://postgrest.com/)

~~~
everdev
PostgREST looks fantastic. Anyone used it in production?

~~~
hucker
Yes, with good results. Anything specific you're curious about?

~~~
patricius
Hi. I'd like to know how you design the database to separate data available
via REST and data that is not.

~~~
hucker
We use a separate schema for the API (that way we can easily version our APIs
as well). The API schemas only contain views/function that expose the
tables/fields/etc that we want to expose.

------
stedalus
Click is nice, but for my money, Fire is what Click should have been.
[https://github.com/google/python-fire](https://github.com/google/python-fire)

~~~
simonw
I love how optinionated Click is, especially around how CLI tools should be
documented. I feel like I built a better tool because Click guided me towards
best CLI design practices.

~~~
pletnes
The significant downside to click is the slow startup time. Seems to add more
than a second to startup in most cases! Still love it though.

------
assafmo
Shameless plug: I made something similar a while back -
[https://github.com/assafmo/SQLiteProxy](https://github.com/assafmo/SQLiteProxy)

Not as fancy as OP's, but runs great in production since last June :-)

------
tomcam
Asynchronous, easily configurable at runtime, and an exemplar of Python code
by one of the creators of Django. Quite a treat.

------
ptype
Nice! Congrats on shipping. Any particular reason it's tied closely together
with sqlite rather sqlalchemy which could potentially enable a range of
backends? I guess to ensure immutability?

~~~
simonw
The "REST API for your database" thing isn't actually the part of this I'm
most excited about - there are quite a few entries in that category already,
such as
[https://github.com/jeffknupp/sandman2](https://github.com/jeffknupp/sandman2)

The thing that excites me is being able to take almost any dataset, package it
up into a simple, well-understood format and then deploy that in a read-only,
highly scalable way, taking advantage of modern containerization systems.

Read-only SQLite is /perfect/ for this use-case. It's fast, widely supported,
has a great feature-set and can be distributed as a single file.

My favourite feature of Datasette isn't the web UI, it's the publish command:

    
    
        datasette publish now ~/path-to-my/sqlite.db
    

Combined with my [https://github.com/simonw/csvs-to-
sqlite](https://github.com/simonw/csvs-to-sqlite) tool it means you can take a
CSV file and turn it into a live-on-the-internet REST SQL API in literally
minutes. Here are a bunch of datasets that I've done that to just today:
[https://github.com/simonw/datasette/wiki/Datasettes](https://github.com/simonw/datasette/wiki/Datasettes)

~~~
stdbrouw
I'm absolutely, unequivocally not trying to be snarky, but isn't CSV itself
perfect for this use-case?

~~~
simonw
CSV is a pretty terrible format for storing and transmitting data: it only
supports string types, the escaping rules are poorly understood, and it can
only model a single flat table.

------
o1lab
shameless plug (for MySql) :
[https://github.com/o1lab/xmysql](https://github.com/o1lab/xmysql)

    
    
      XMYSQL : One command to generate REST APIs for any MySql 
      Database.
    
      > 20+ releases to npm in last 2 weeks. 
      > Have added 'a ton' of features since last shown on HN.

------
timbit42
Does Commodore own the trademark for the Datasette name?

~~~
mastax
Expired in 1993:
[http://tsdr.uspto.gov/#caseNumber=72389457&caseType=SERIAL_N...](http://tsdr.uspto.gov/#caseNumber=72389457&caseType=SERIAL_NO&searchType=statusSearch)

~~~
timbit42
This was trademarked by Audio Devices in Stamford, CT, in 1971, which makes
sense since the Commodore Datasette likely didn't exist until the Commodore
PET was created in 1977. Commodore seemed to use both 'datasette' and
'datassette'. Perhaps Commodore registered the double-S spelling. Perhaps they
were challenged by Audio Devices for using the 'datasette' spelling.

------
anonfunction
Any examples of how it handles pagination? Couldn't find it in the README.md

~~~
simonw
Good catch, I haven't documented that yet.

There's a hard limit of 1,000 rows allowed to be returned at any one time.
This is because the overhead of serializing more JSON that than could
potentially lock up the server. If you try to pull more than that back you get
a warning that the data has been truncated, e.g.
[https://fivethirtyeight.datasettes.com/fivethirtyeight-2628d...](https://fivethirtyeight.datasettes.com/fivethirtyeight-2628db9?sql=select+*+from+%5Btwitter-
ratio%2Fsenators%5D) \- in the JSON that shows up as "truncated": true

If that happens to you using raw SQL, it's up to you to implement offset/limit
pagination.

If you are looking at a view, the tool implements offset/limit pagination for
you e.g.
[https://parlgov.datasettes.com/parlgov-25f9855/view_election...](https://parlgov.datasettes.com/parlgov-25f9855/view_election?_next=100)

For iterating through a raw table, a more efficient mechanism is used: cursor-
based pagination. The table is ordered by its primary key (or its rowid if no
primary key was specified) and the "next" links say "get the next 100 items
where the primary key is greater than this value" \- e.g.
[https://parlgov.datasettes.com/parlgov-25f9855/cabinet_party...](https://parlgov.datasettes.com/parlgov-25f9855/cabinet_party?_next=100)

This means even a table with millions of records can be efficiently paginated
through - unlike offset/limit which tends to get slow after a few hundred
thousand records.

------
dangoor
This is really cool! It'd be fun to have a GraphQL variant of this.

~~~
slaymaker1907
Is rewriting in GraphQL the new rewriting in Rust?

~~~
ZenoArrow
I'd also like to see the GraphQL version of this, mainly because of GraphQL's
strongly typed schema definitions. Means less work required to validate
inputs.

------
wsxiaoys
Congrats for the launch! I was searching for such tool exact yesterday:)

I understand the benefits of Immutable pattern, but it'll be nice if
datasettle could provide interface with single non-frequent writer(for
example, most FRED data is updated seasonally, some of them updates daily) so
the pipeline pulling data from some source could be automated

~~~
simonw
The immutable thing is pretty key to the whole project - if you do need a
writable API there are a few options out there already though.
[https://github.com/coleifer/sqlite-web](https://github.com/coleifer/sqlite-
web) looks particularly good.

------
mythrwy
Cool little utility. Like the fact you can pass SQL in the query string. Have
to look for a use for this. Maybe internal use apps that update SQLite
periodically from a larger database and just need a view of data. Thanks for
sharing.

------
ccarse
Disappointed this doesn't work on Windows, looks really cool.

------
dgudkov
The name is brilliant.

------
justboxing
Congrats on shipping!

I have a sqlite DB with historical Visa data, and I want to expose it as an
API (like your software does), but with API key. Not a python user here, can
you tell me about the following?

1\. Is it possible to put some security behind the API? So only queries with a
key can access it?

2\. How would I host this API? Where would the SQLite DB reside?

3\. I understand the API is read-only, which is great and works for my use
case! Can I also do read-writes to the database from a separate application
that will keep updating the database on a nightly basis?

~~~
simonw
I've not considered authentication at all. If you do need that, my preferred
option would be to put it behind a proxy such as nginx that can be configured
to require authentication.

~~~
malingo
I've had good luck with this library to use OpenID Connect with Nginx:
[https://github.com/pingidentity/lua-resty-
openidc](https://github.com/pingidentity/lua-resty-openidc)

------
rmbeard
This looks great except for the fact that it runs through cloudflare, that's a
big negative for me and will restrict a little my ability to deploy this.
However, I can still see some great applications of this that I can't wait to
try out.

~~~
simonw
Cloudflare is absolutely not a requirement - [https://polar-
bears.now.sh/](https://polar-bears.now.sh/) for example is running directly on
now.sh, and you can deploy a datasette anywhere that can run a Docker
container (or just use plain Python 3).

Because the data is meant to be immutable, Datasette works extremely well with
any caching proxy. Cloudflare just happens to have the most generous free
tier, but self-hosted Varnish or Fastly would work just as well as Cloudflare,
if not better.

