
SQL as an API language - craigkerstiens
https://simonwillison.net/2018/Oct/4/datasette-ideas/#SQL_as_an_API_language
======
forkandwait
Slightly on topic: when giving unsolicited tech career advice, my favorite
trope is "learn sql". Don't have any computer background? Start with SQL.
Already write compilers for living but don't know SQL? Of course you should
learn sql. A www dev who is tied to ORMs? Learn SQL. A manager who wants to
understand the IT that runs your department? Yep, SQL. Etc, etc. SQL is
probably the most ubiquitous language under hood today.

~~~
adito
> Already write compilers for living ...

Please pardon my ignorance, I really interested in this. What kind of job
where you are writing compilers for living? The one that I know is something
along Mozilla's, Google's, Apple's dev that working on the Rust/Go/Swift
language. Is there any other kind of jobs that also in this writing compiler
for living category?

~~~
dspillett
Larger chip manufacturers (those with the resources to be designing their own
CPUs/GPUs/similar) will likely have a dedicated team maintaining reference
compilers for their designs, especially the more experimental or secret
designs that are not yet proven and/or publicly available.

Those creating specialist processing chips more dynamically in FPGAs may also
have someone working in that area if they intend to control the result via a
higher level language.

------
steve-chavez
You could get a more complete solution with PostgreSQL + PostgREST. Regarding
the concerns:

> Security: we don’t want people messing up our data

Row level security + roles/grants should be enough to cover this one.

> On performance: SQLite has a mechanism for canceling queries that take
> longer than a certain threshold.

With a timeout, what you'll need is to request the expensive queries in
parallel to ddos the service. Would be better to filter by the query plan, in
PostgreSQL there's an extension for this
[https://github.com/pgexperts/pg_plan_filter](https://github.com/pgexperts/pg_plan_filter).
Though pg also has a `statement_timeout`.

> Hiding implementation details

You can expose a dedicated schema for you API that doesn't contain a single
table but only views and stored functions. Migrating this schema should be
easy thanks to transactional DDL.

That being said, I agree with the sentiment about GraphQL, besides the extra
query language, there's an additional impedance mismatch since now you have to
map your DB schema to the GraphQL schema.

~~~
copperx
I'm always surprised that PostgREST isn't the industry standard for creating
an API. It looks like it would save so much work and the corner cases can be
handled without writing code in a different language.

What am I missing?

~~~
ufmace
I tried it, and still have it running on a little toy service I built. It's
cool, but IMO, it doesn't really seem production-ready for a large-scale
service. Last time I updated it, I had to spend a full day sorting out the
changes they made to how the server config and user authentication works. Last
I checked, they don't even have a suggested script to run it as a service.
Going by the contributor graph, it's still essentially a 1-person project.

~~~
mmt
IIRC, this was what the connection poolers (pgpool, pgbouncer), and perhaps
all now-popular postgres extensions/enhancements, looked like in their initial
years, as well.

In the context of a "large-scale service", my guess is that there's a chicken-
and-egg problem, where an operators of such a service has to decide they want
it to be production-ready enough to throw money and/or engineering resources
at the project, which can attract yet more contributors.

~~~
profalseidol
> a chicken-and-egg problem

The way I see it. Large corporations have the golden egg laying hen. And
plebian project like PostgREST have little to no chance.

------
skrebbel
Note that this is solely in the context of a static data set. That makes this
a relatively limited idea, since most datasets are either dynamic or small
enough to just copy over. I write this not to dismiss Datasette, which looks
like an _excellently_ designed tool for the problem it's meant to tackle, but
mostly as a response to the title of this particular HN submission, "SQL as an
API language" which suggests maybe more than "just" querying static data.

Thing is, I really want to use/expose SQL as an API language for dynamic data.

E.g. the author writes:

> _On hidden implementation details: since we are publishing static data
> rather than maintaining an evolving API, we can mostly ignore this issue. If
> you are really worried about it you can take advantage of canned queries and
> SQL view definitions to expose a carefully selected forward-compatible view
> into your data._

At my company ([https://talkjs.com](https://talkjs.com)) we explored making
our public API SQL-based for the chat & message data that our customers host
with us. It would solve lots of customer requests in one go, much more so than
adding a new REST resource for every kind of question a customer might want to
ask our API. What's been holding us back is exactly this: Even if we'd
carefully design a set of database views to guarantee backward compatibility,
doing so will _severely_ limit the kinds of database refactorings we'd be able
to do. Especially when it comes to adding redundancy, denormalizing stuff,
etc.

If anyone else has experience with exposing queryable (SQL or otherwise)
dynamic APIs directly to customers, I'd love to hear your insights. How did
you tackle this problem?

~~~
scrollaway
What you're describing is exactly what GraphQL sets out to solve. Most of the
flexibility and declarativeness of SQL, but still sitting a layer above the
raw data itself.

~~~
aikah
> Most of the flexibility and declarativeness of SQL,

does graphQL support aggregate functions like count,sum,average and co?

~~~
scrollaway
No, that's why I said most; and that's also what makes it more appropriate for
APIs -- you don't want uncontrolled performance hitting you like this; if you
wish to allow for aggregates, you expose them as properties.

~~~
derefr
> you don't want uncontrolled performance hitting you like this

On RDBMSes with multitenancy support, you can simply apply resource
constraints and quotas (in CPU time, CPU percentage, wall-clock time, memory,
disk usage, etc.) to each user and to each connection. (You just have to
ensure that each separate customer ends up hitting the DBMS as a separate DB
user.)

The entire point of the OLAP model is to allow clients to ask their _own_
questions. Some of those questions may be too expensive to answer—modern
DBMSes are smart enough to realize which queries those are, and allow the
sysadmin to restrict users from doing them. But that restriction should always
be a matter of _scale_ , not of _kind_. Users should be able to do a very
complex query over a small data-set, or a very simple query over a huge data-
set, or a combination, as long as they aren't actually impacting the
performance of the cluster for other users.

~~~
scrollaway
Sure. And if that's a model that is appropriate for your API users, go for it.

But usually, it won't be appropriate. And by usually, I mean it will almost
never be appropriate to adopt a model like this for your users. And I say this
having been in the very situation where it _was_ appropriate and we did expose
it.

~~~
derefr
Alternately, what's wrong with just tracking the expense of the user's query
and then passing the costs of it back to them? Like most "scale-free" DBaaSes
(Dynamo, BigQuery, etc.) do.

This is assuming your users are _paying_ for your API, mind you. I have no
idea why you'd _want_ to provide an online relational data-access service for
free. (The "free" option for OLAP is, and has always been, "we give you a .zip
of CSV files, and you ETL them into your own DB to look at them." That's how
things like government GIS data or Amazon's and Google's hosted "public
datasets" work—you pay for online access, or you get free _offline_ snapshot
dumps.)

~~~
scrollaway
You are the only one who has been mentioning OLAP so far. It's just one use
case among many.

~~~
derefr
OLAP is the only use-case where queries that require complex aggregate
computations with potentially-unbounded runtimes are relevant. When you
subtract GraphQL from SQL, the difference is "arbitrary OLAP reporting
capabilities." So I think it's fair to focus on this use-case when talking
about what exactly having an SQL API gets you (over having a GraphQL API.)

~~~
scrollaway
I'm not really sure what you're trying to communicate here. Arbitrary
aggregates are not usually an appropriate thing to expose to users as part of
an API. Where they're actually relevant doesn't change that.

~~~
derefr
Maybe you don't understand the original _point_ of SQL as a language? It's a
language where humans—usually, employees who are granted access to data, or
customers who are paying for access to data—type up descriptions of reports
they want to receive, in a formal language; send it off to a remote system;
and then those reports, magically, come back. In between, there can be
anything—maybe a bunch of interns manually collating data from thousands of
Excel spreadsheets. But usually there's a DBMS, an automated system whose job
is to do _whatever is necessary_ to turn the data it has, into the reports
people need—no matter how complex a task that turns out to be. Because it's a
complex task that _needs doing_.

I don't think it matters whether you call that "an API"; I think your implicit
point here is that "an API" is something that you expose to the general
public, and this is ...not that. But if we're talking about _the benefit
people get from having SQL-shaped access to your data model_ , then this is
what it is: the ability to specify complete, arbitrarily-complex reports that
they want your system to calculate out for them.

To me, the entire point of "an SQL API", as opposed to any other kind of API,
is that it allows the data in your system to serve the needs of people who
need those reports. Other users can be satisfied with pre-defined reports. The
point of SQL is to serve the needs of users with _arbitrary moment-to-moment
needs_. Users who need to _explore_ and _decompose_ your data in ways you
never considered. Users who need to _experiment_ inside your data model; to
_investigate_ it, to _audit_ it.

With any other kind of API, such people would be too constrained by your pre-
built possible queries to satisfy their needs; and so would instead need a
whole ETL pipeline built for them, to take your raw data _out_ of your system,
and put it _into_ their own (OLAP) system, where they could then ask _that_
system to generate their reports.

If you expose an SQL API, you obviate all of that—the customers or employees
who need to generate complex reports from your data can just _ask your system_
to go do that.

\---

Consider a government agent tasked with auditing Facebook's GDPR compliance.
How would they _be enabled to_ do that?

Facebook certainly can't just take _an offline dump of the entirety of
Facebook 's data_ and give it to the auditor, in order for them to play with
it in their own OLAP systems. It won't fit, and that's a GDPR violation in-
and-of-itself besides.

Alternately, dumping a random sampling of their data might miss important low-
frequency edge-cases.

That leaves online examination of their data.

It's not already in relational form—most of it is in Cassandra, a NoSQL
database. So Facebook can't just give them access "to their RDBMS"—they don't
have one.

But none of these other systems _directly_ support the kinds of arbitrarily-
complex queries the auditor needs to do.

What would Facebook need to build, for this use-case?

Why, an SQL API gateway server, of course—one that wraps their entire data
warehouse, allowing the auditor to issue arbitrary SQL queries that reach out
to join together and process data from all of Facebook's services.

 _That 's_ when you have "an SQL API."

\---

And there are more use-cases than just "auditors."

If, for example, the reason you're _collecting_ data is to do analytics to it
(this is e.g. advertising data), then the point of _your business_ is selling
your customers the ability to arbitrarily analyze that data. You might have an
online GUI for doing basic BI-style break-downs, but the only "fallback"
interface that your customers will be truly satisfied with is an SQL API.

~~~
scrollaway
This is a lot of text to essentially say what I've been repeating: If it's the
right tool, use it. I never said it's _never_ the right tool. What you are
describing isn't the usual scenario for API consumers.

------
bpchaps
A really neat project (written in haskell) for turning a postgres database
into a restful API:

[https://github.com/PostgREST/postgrest](https://github.com/PostgREST/postgrest)

~~~
Noumenon72
I wish it gave a little "hello world" example of what it lets you _do_.

~~~
bpchaps
From that repo:
[http://postgrest.org/en/stable/api.html](http://postgrest.org/en/stable/api.html)

~~~
Noumenon72
That's just what I wanted. What kind of search skills did you use to find it?
Is there a convention?

------
batbomb
We do this heavily in astronomy:

[http://ivoa.net/documents/TAP](http://ivoa.net/documents/TAP)

Although it's a bit crusty, it is heavily used. We store additional metadata
about columns and we annotate the results we return.

You can build something easily like this in Java with the Presto parser,
including adding authorization to systems that can't/don't support it.

If Go is more of your thing, there's a parser in Vitess:

[https://github.com/vitessio/vitess/blob/master/go/vt/sqlpars...](https://github.com/vitessio/vitess/blob/master/go/vt/sqlparser/sql.y)

I had done some work on porting the presto parser to python
(github.com/slaclab/lacquer) to also build an API, but I had some problems in
my parser grammar I never figured out. I was going to revisit this again now
that Antlr4 can deal with Python properly and I can reuse the presto grammar,
but I haven't had the chance.

You can couple all of this with csvw as well, and get some additional nice
features out + JSON-LD.

------
mindcrime
I'd rather see "SPARQL[1][2] as an API language" myself. SPARQL is criminally
under-appreciated.

BTW, if anybody wants to play around with SPARQL and see some of what it can
do, the Wikidata public SPARQL endpoint[3] is very nice and comes preloaded
with some great examples[4].

[1]:
[https://en.wikipedia.org/wiki/SPARQL](https://en.wikipedia.org/wiki/SPARQL)

[2]:
[https://www.w3.org/TR/sparql11-overview/](https://www.w3.org/TR/sparql11-overview/)

[3]: [https://query.wikidata.org/](https://query.wikidata.org/)

[4]:
[https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/...](https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples)

~~~
owyn
This may just be down to wikidata's implementation, but the reliance on
hardcoded unique id's everywhere is... bleh

The following query uses these:

Items: hospital (Q16917) Properties: instance of (P31) coordinate location
(P625)

    
    
      2 #defaultView:Map
      3 SELECT * WHERE {
      4   ?item wdt:P31*/wdt:P279* wd:Q16917;
      5         wdt:P625 ?geo .
      6 }

~~~
mindcrime
You always need unique identifiers for things, that's pretty much core to the
whole Semantic Web idea. But what makes the Wikidata approach frustrating to
me is that they use these identifiers that are totally opaque. I can sort of
understand why they did it that way, but it does make for ugly queries.

------
tango12
SQL could actually be an API language without needing to be wrapped in
JSON/GraphQL. Building a GraphQL backend makes you deal with a GraphQL AST,
parse it, process it and so on. GraphQL backends are hard to build because
naive implementations can't "optimise" the AST and resolve to making
individual data calls for each node in the AST. But considering the tooling
around GraphQL on the server and client side, it would have been amazing to
have that API tooling around SQL! Assuming that writing an SQL parser, hooking
into the AST to run the business logic or to add auth rules, and then use
great client side libraries like jooq [1], massivejs [2] to be able to
construct SQL queries programmatically, SQL as an API could have been a very
feasible and neat idea!

The similarities between GraphQL and SQL are many on the surface. Another nice
parallel that the blogpost doesn't mention is the idea of query variables in
GraphQL which is reminiscent of prepared statements + variables in Postgres.

We believe in a similar thesis at Hasura[3], where the idea is to give app
developers GraphQL on SQL (Postgres) with minimal abstraction. However, we've
had to do a lot of work to support "application level" authorization so that
read/write permissions are controlled through the applications auth system and
make it a more complete solution.

In fact, our initial version was basically a JSON DSL very similar to
datasette, and we switched to GraphQL primarily because the client-side
tooling is quite amazing.

[1] [https://www.jooq.org/](https://www.jooq.org/)

[2] [https://github.com/dmfay/massive-js](https://github.com/dmfay/massive-js)

[3] [https://github.com/hasura/graphql-
engine](https://github.com/hasura/graphql-engine)

------
egeozcan
I really want to use SQL _everywhere_. I'm sad that they deprecated the
WebSQL, and looking forward to the WebAssembly implementations (IIRC there
already was a project doing that).

~~~
mamcx
I step up the challenge: To use the relational model everywhere. I'm working
on it:

[http://tablam.org](http://tablam.org)

Eventually, if this project move forward, I pretend to provide a nicer version
of sql on top of it, so get possible to do:

from customers where id=1

instead of

select * from customers where id=1

------
wenc
This sounds unintuitive, but the most lightweight API for analytics work I've
come across is a SQL view.

It provides one level of indirection from the underlying data structures (if
the data changes, you can just rewrite the view), and most db's provide very
good access control. The view's schema is your interface contract, so users
can rely on its output.

I was all about building REST interfaces to our backend databases until I
realized that SQL views were a simpler, more efficient abstraction for
analytics/machine learning work.

You may be like, "SQL views aren't REST APIs! It doesn't seem kosher!"

But if you think about it, REST APIs were designed for retrieving _small_
amounts of data quickly using simple parameters, which is useful for low-
latency web applications.

For analytics/machine learning work, we generally don't want that. We usually
want to retrieve _large_ chunks of pre-computed tabular data. And there's
already a language for that... SQL. And SQL databases are naturally fast for
large chunks of data (with optimizations like indexing, compression, etc. your
results can get even faster... plus no JSON serialization cost).

If you build a REST API to SQL datasets, you have to first (inefficiently)
parse the returned JSON, and reconstruct the dataframe for downstream
analytics/machine learning work. Why do that, when you can use SQL (with all
of its attendant data aggregation/manipulation keywords) to efficiently
retrieve data via a binary wire protocol (native drivers or ODBC/JDBC) into a
dataframe-like object?

~~~
Serow225
Can you please explain what the interface looks like a little more concretely?

~~~
wenc
It's exactly as it sounds: a SQL view on a database server. It manifests as a
SQL table backed by a query. It's mind-bendingly simple.

You can connect to it via a native db connector or ODBC. It's also accessible
to a wide range of tools including Excel, Tableau, or any number of
programming languages.

In Python you can pull data from it into a Pandas dataframe via TurbODBC
(which recently added Apache Arrow in-memory support). In R, you can connect
to it via dplyr. Apache Spark can get data from it via JDBC. You can
experiment with it via Jupyter Notebooks.

SQL support is ubiquitous.

------
tejasmanohar
Salesforce actually exposes a SQL variant in their bulk export API called
SOQL, [https://developer.salesforce.com/docs/atlas.en-
us.soql_sosl....](https://developer.salesforce.com/docs/atlas.en-
us.soql_sosl.meta/soql_sosl/sforce_api_calls_soql.htm). It made it
surprisingly easy for me to do some SF ETL tool a few years ago.

------
zrail
I've been thinking about this in terms of GDPR. Most companies will just give
a CSV dump to the user, but what if we exported it as a SQLite database
instead (or in addition to)?

~~~
Radim
Ha! We do offer powerful "personal information analytics" in terms of GDPR,
[https://pii-tools.com](https://pii-tools.com) :)

It's more for auditors and DPOs though.

The challenge with GDPR is more simplicity, taming the mess. Rather than
adding more options. What sort of use-case scenarios do you have in mind? What
need would SQLite exports solve?

------
Someone
Reading the title, I expected something like

    
    
      update screen
      set red=3, green=200, blue=125
      where (row-100)*(row-100) + (col-200)+(col-200) < 12345
    

That could be a wonderful hack, certainly if combined with triggers that
generate or even react to screen updates. Carefully done joins could easily
blit in character bitmaps and icons.

If you have time at hand, feel free to implement this idea by adding a storage
engine that stores on screen to your favorite sql database. Bonus points if
you manage to run many of the updates on the GPU.

------
domoritz
I actually implemented something like this a few years ago on top of Postgres
for the CKAN Data Portal Software. Check out the docs at
[https://docs.ckan.org/en/2.8/maintaining/datastore.html#ckan...](https://docs.ckan.org/en/2.8/maintaining/datastore.html#ckanext.datastore.logic.action.datastore_search_sql).

CKAN is used by data.gov, data.gov.uk and many other governments to publish
open data.

------
jillesvangurp
The problem with SQL is that most products that implement it have subtle (and
not so subtle) differences with custom functions, types, syntax for all sorts
of things, etc. Sqlite is actually rather limited and lacks a lot of stuff
you'd find in e.g. postgres or mysql, both of which tend to implement the same
things very differently. That's of course aside from things like joins, sub
selects, etc. ANSI SQL only gets you so far.

I'd argue decoupling how you distribute data from how you store data, and how
you are going to query it is probably a good thing.

CSV is indeed not a great format because it lacks structure and types
(everything is a string). Flattening complex data in csv format can get ugly
quickly. I recently had to deal with some proprietary object database dumped
out as csv. I had some great fun trying to reverse engineer their schema from
the raw data.

XML has the same problem unless you dream up some schema that explicitly
states the type. Json has some basic types for strings, booleans, and numbers.
This makes it a bit more useful than XML out of the box. Both xml xml and json
objects represent trees of information and SQL databases represent tables of
rows. Of course some databases can store json (or xml even) and allow you to
do some queries over that. E.g. Elasticsearch does a decent job of guessing
schemas for json documents. It will figure out things like dates, geojson,
numbers, strings, etc.

So, distributing data as sqlite files is not necessarily a horrible idea but
it doesn't really solve the problem of how to move data between different data
tools. What's lacking is a standard way to to distribute strongly typed
tabular data. Most databases can import/export csv and native inserts for
their flavor of sql but not much else. Most (open) data get shipped using some
custom schema on top of xml, csv, json or whatever else seemed fashionable at
the time.

------
ricardobeat
The author is missing a few important notions:

\- graphQL server is storage-agnostic. It allows you to mix databases, APIs,
anything, into one cohesive queryable model. It acts as a hub for several data
sources. Consumers don’t need to care

\- queries are easily composable, reusable, and can be aggregated for
batching, deduplication and optimisation (can be done with a query planner,
but is way more complex)

\- it’s safe enough for queries to be sent directly from the client on user
devices

------
truth_seeker
I am all for it. Also, Postgres FDW for network(TCP/HTTP/UDP) sockets would be
great to further enhance or push the capabilities of PG

------
emmelaich
See also

Programming in the URL string
[https://apenwarr.ca/log/20121218](https://apenwarr.ca/log/20121218)

Previously (not :-) discussed much at
[https://news.ycombinator.com/item?id=4939674](https://news.ycombinator.com/item?id=4939674)

------
lukaseder
> On security: the data is read-only, using SQLite’s immutable mode. You can’t
> damage it with a query—INSERT and UPDATEs will simply throw harmless errors.

In times of data breaches, the naivety of this statement is alarming. It is
not too difficult to extract (all) data from an API that allows for passing
arbitrary SQL statements.

Be careful here!

~~~
lukaseder
I mean this:

[https://fivethirtyeight.datasettes.com/fivethirtyeight-c9e67...](https://fivethirtyeight.datasettes.com/fivethirtyeight-c9e67c4?sql=select+*+from+sqlite_master)

------
halfway
The best "API language" is the one constructed to work with the app's data
model. SQL constructs are relational - fine for rails, django, or salesforce
apps which map screen to table (or table join).

Wouldn't work for facebook, google knowledge graph, or the larger enterprise
app vendors today. Theirs is typed graph.

------
dustingetz
Don't use JSON, XML has been able to do that since 1996!

I actually agree with the premise – my startup is
[http://www.hyperfiddle.net/](http://www.hyperfiddle.net/) – but SQL is too
complicated and abstraction-resistant in a dozen different ways to be a
suitable foundation.

------
cryptonector
Use PostgREST (written in Haskell) and PostgreSQL. No SQL injection, but you
get a lot of the power of SQL anyways.

------
commandlinefan
In my experience, REST APIs devolve into thin wrappers over SQL statements
anyway. Might as well embrace it.

------
qwerty456127
> On performance: SQLite has a mechanism for canceling queries that take
> longer than a certain threshold.

Can't it consider the complexity of the query and the actual database
(indices, tables size etc) to guess how heavy the request is goign to be in
advance?

~~~
arez
sure, but aren't there some things that can't calculated in advance like i/o,
locked tables or priority queries just to name a few

~~~
qwerty456127
> sure, but aren't there some things that can't calculated in advance

I don't mean precise calculation but a reasonable guess

> like i/o

It can be benchmarked and estimated.

> locked tables

Why would tables be locked if we work in read-only mode?

~~~
jondumbau
How did the data get there.

~~~
qwerty456127
The page by the link says it's about the immutable mode and INSERT and UPDATE
errors raise errors. As about me it's a very usual use case: I pre-populate a
database with a huge data set (some GiBs) in a single batch and work with the
data (data science stuff) in read-only mode then, some time after, add another
batch of data (usually much smaller than the initial one) to it but database
writing and reading never happens simultaneously in this scenario. In fact
Datasette seems a thing I've always wanted as it is probably going to let me
access my SQLite databases over a network this way.

------
voltagex_
[https://www.geopackage.org/](https://www.geopackage.org/) is worth a look too
- although it's more "SQLite as file format" than SQLite as API.

------
z3t4
Is there any better alternatives to SQL for requesting data !?

