Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Datasette Publish – Turn CSVs into a SQLite-Backed JSON API (datasettes.com)
101 points by simonw on Jan 17, 2018 | hide | past | favorite | 21 comments

Way more information plus an animated screenshot showing how to use it on my blog: https://simonwillison.net/2018/Jan/17/datasette-publish/

This is built using the brand new Zeit Now API, which makes it easy to write software that in turn deploys other software: https://zeit.co/blog/api-2

Xmysql: One command to generate api for any MySql database. Generates api at blink of an eye (literally).


Has some really nice apis for group by, join and intutive way to perform 'where', 'order by' and other MySql clauses.

(shameless plug)

Ryan Pitts on Twitter used this to build an API for UFO sightings in Washington state and the North West: https://twitter.com/ryanpitts/status/953728651945627648

Here's his datasette: https://datasette-cmwqbwgnqe.now.sh/csv-data-5284def/nuforc_...

And here's an example SQL query showing the cities with the most UFO sightings: https://datasette-roborbllfq.now.sh/csv-data-b1429ef?sql=sel...

Related option for those who are less focused on the JSON API and instead want to make their datasets public and queryable: data.world added SQL support a little while back, joining the original SPARQL. One big advantage there is that folks can write queries that join your datasets to other open datasets on data.world.

This can be a good solution to use with JS charting libraries that pull CSV/JSON data from an external data source. Add dynamic chart updates on source data updates and you get cheap and easy dynamic web charts.

Original discussion around Datasette's launch back in November 2017:


This is amazing. Great idea I will try it. I've also been using Zeit now and love it too

been seeing a lot of these convert X into JSON api with query ...but what would be the use cases for this type of tool? who needs this type of solution?

Datasette is specifically designed for publishing static data - it doesn't handle UPDATE/INSERT traffic at all.

This is a good fit for any time you want to share some static data with the world - instead of posting a CSV file, you can share your data as a Datasette instance with a browseable web UI and a JSON API instead.

The three use-cases I'm personally most interested in for this are:

Data journalism: newspapers collect data about e.g. police shootings https://vice-police-shootings.now.sh/ or actions under the antiquities act https://fivethirtyeight.datasettes.com/fivethirtyeight-2628d... - I want to make it as easy as possible to publish the data beneath the stories.

Government open data: cities like San Francisco publish huge amounts if interesting data. I used San Francisco's 190,000 line CSV of trees in the city to build https://sf-tree-search.now.sh/

Cultural institutions: museums, libraries and art galleries have the collection and sharing of data as part of their mandates. I want to make it as easy as possible to build things like the London Science Museum's collection API: https://group.sciencemuseum.org.uk/about-us/collection/using...

Why not convert the data from csv to json and serve the json blob from your object store or CDN (if this solution is immutable)? No need for the API interface, just gzip the json blob and have the client retrieve it entirely.

No need for an app, app server, and the json can be cached with the rest of your content. Unless I’m missing something?

Size. The San Francisco trees database is around 50MB, and I've successfully tested it with multiple GB sized SQLite databases.

this was a key answer to a good question. I think you should emphasize that part (handling large data sets) as a key aspect of datasette

If you're serving the entire JSON dataset, sure, but this solution allows you to do server-side filtering.

I'm just finishing up a tool that lets you use Google Sheets as a data API resource that returns JSON. In my particular case I have many clients who want the ease of use of WordPress but don't have the resources to keep it properly maintained in terms of update security. Many of these clients do not actually need to update most of their site on any sort of frequent basis. The case that prompted my new project is a restaurant who updates their daily specials on a daily basis, their seasonal specials on a monthly basis, and other menu items maybe 2-3 times a year in addition to semi-frequent homepage background image changes and an occasional flash message. Instead of giving them WordPress, this is what I did:

The site is built in Jekyll so it is 100% static. The dynamic portions of the site like the food menu makes use of Jekyll's data files feature. The site owner makes changes to their menu in a Google Sheet (which is where they typically store this anyway) and then chooses a deploy target and submits a custom Google Form. The form submit pings a listening Go server on the deploy target which then kicks off a build script. The build script requests the Sheet data one sheet at a time and stores the results in JSON files which Jekyll can use. A couple of additional steps to get images into place and then the site is compiled by Jekyll and linked into the nginx webroot. The script even writes back to a log sheet to let the user know how the build went. With selectable deploy targets the site owner can preview her changes on a staging site first before deploying to prod.

And just like that, the user has a site with all the speed and security of a static site and all the flexibility of a dynamic site.

Is this a managed hadoop/hive cluster with some sort of field heuristics baked in? Looks cool and very useful for a certain set of users.

Nope, it's much more simple than that. Each CSV upload is converted into a SQLite database. Then I deploy a static, read-only copy of that database to a Zeit Now hosting account and fire up my "datasette" command-line tool (written in Python 3) which opens the SQLite database and starts serving up queries from it.


The magic here is that the SQLite database is opened in read-only mode, which means I don't have to worry about concurrent access or write activity. SQLite is screamingly fast if you use it in this context.

Because the data is read-only, if you ever DO need to scale to handle more traffic you can do so by firing up additional instances, each with their own copy of the SQLite database file.

This is awesome. Could this be extended so that new CSV uploads can be added at a later date and deduplicate?

Zeit Now deployments are immutable, but I can have an "edit" feature which retrieves the data used for a deployment, lets you modify it (add new CSV files for example) and then deploys a brand new instance with the new changes. Zeit supports URL aliases (which I need to expose in Datasette Publish) so you can point an alias at the fresh deployment.

Stuff like CSV de-deplication is out of scope for Datasette Publish - it's something you would need to do with other tools (e.g. Excel) before uploading the data.


Now the bottle neck to publish data getting the API and dB to be hosted somewhere because that part is still not trivial for many

That's exactly what Datasette Publish solves: it's a tool which you upload CSVs to and it deploys the application on the internet for you, for free.

Under the hood it's taking advantage of the incredibly generous free hosting tier offered by Zeit: https://zeit.co/pricing

As a user though, you don't need to worry about this at all. Sign in, verify your email address (which signs you up for an account with Zeit), upload your CSVs and Datasette Publish will deploy the application with your data and give you back a URL hosted on the public internet.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact