
The Consortium for Python Data API Standards - BerislavLopac
https://data-apis.org/blog/announcing_the_consortium/
======
hogu
Just to set some context - this is being driven by Quansight which is led by
Travis Oliphant, who united Numeric and Numarray to create NumPy many years
ago which lead to the entire python data science ecosystem as we know it. Yes
it's ambitious, and they may not succeed, but it's not their first barbecue.
Unifying communities and APIs is something they're good at.

~~~
hogu
For some context on why this matters, if you're writing a library (like
sklearn) and you want to support multiple array types, you might need to do
stuff like

    
    
      if isinstance(x, ndarray):
          ...
      elif isinstance(x, other_array):
          ...
    

In the most ideal case having the standard means that scientific libraries can
support all conforming implementations by default. Then sklearn would
automatically support cupy/numpy/dask/jax/mxnet/pytorch/tensorflow arrays.
Multiply that by all the scientific libraries and the effect is pretty
profound.

------
simonw
The software they're running as part of this effort is really smart.

[https://github.com/data-apis/array-api-comparison](https://github.com/data-
apis/array-api-comparison) is a bunch of tooling to help illustrate the
differences between the different APIs.

[https://github.com/data-apis/python-record-api](https://github.com/data-
apis/python-record-api) is particularly interesting: it runs a GitHub Actions
workflow that uses the Python sys.settrace mechanism to track all calls
exercising those libraries and derive their type signatures, then generates
type files and commits them BACK to the repository. So
[https://github.com/data-apis/python-record-
api/tree/2259df2f...](https://github.com/data-apis/python-record-
api/tree/2259df2f876625abae7365931f203467906fd4ae/data/typing) has a whole
bunch of type information that's maintained by the bot.

------
BiteCode_dev
If arrays are standardized, we may even see a future where they end up being
in the stdlib, while libs like numpy, dask, etc. will mostly provide tooling
to work with them.

This would be a huge deal, allowing pure Python libs to suddenly exist for a
whole lot of things like image/video manipulation, advanced stats, and so on,
without requiring compiled extensions.

I love numpy, but having to install the 100Mo beast is overkill for a lot of
use cases. Not to mention restricted environments where you can't.

Well, one can dream, no?

Imagine having a numpy-like array as a primitive. Suddenly you can make the
case for a K/J-like DSL embedded in it like regexes are for text. Suddenly
Python-fast enough perfs are getting an order of magnitude higher. Suddenly
you get a powerful shared buffer for multiprocessing.

~~~
BerislavLopac
A good first step in that direction would be writing a solid PEP. But I'd say
that it would be helpful to have a common standard ready first.

------
rymurr
I suppose this can be compared to standardising SQL. However that has ended up
with a least common denominator of simple operations and dozens of specialised
dialects. Better than before the standard but certainly not great and tons of
engineering has gone into cross dialect facades.

Granted I haven't thought as long or as hard as the authors but I wonder if
there are easier ways to go about this:

a) something akin to slf4j or one of the many unifying facades in java b)
standardising the data structures/formats. eg Arrow has done quite a good job
positioning itself as a data lingua franca and has compute primitives.

Regardless, I will be watching with interest!

------
wjn0
I think this might be a bit premature. For the bulk of these tools
(pandas/numpy/tensorflow/torch), most of their users have probably been using
them for 3-5 years max.

A universal API standard is a nice idea, but I don't know that the ecosystem
has been around long enough to justify this effort.

Given the asynchronous nature of consortiums though, maybe this exactly the
right time to start talking about this.

Will be watching closely either way.

~~~
travisoliphant
This is about creating a standard for array computing in Python which has been
around for more than 20 years. NumPy (which has been a de facto standard) has
been around for 14 years and was based on Numeric which was around for 10
years before that. Pandas has been around almost 10 years now. The parts of
the standard that are clear and will emerge are the parts that have been
around and used by sufficient numbers for at least a decade.

------
mlthoughts2018
This actually strikes me as a huuuge waste of effort. I work with every one of
the different technologies they mention every day.

The differences and idiosyncrasies are truly, truly not a big deal and really
what you want is to allow different library maintainers to do it all
differently and just build your own adapters over top of them.

This allows library developers to worry about narrow use cases and have their
own separate processes to introduce breaking changes and features that deal
with extreme specifics like GPU or TPU interop, heterogeneous distributed
backed arrays, jagged arrays, etc. etc.

Let a thousand flowers blossom and a thousand different Python array and
record APIs contend.

End users can write their own adapters to mollify irregularities between them,
possibly writing different adapters on a case by case basis.

If any “standard” adapters gain popularity as open source projects, great -
but don’t try to bake that in from an RFC point of view into the array & data
structure libraries themselves. Let them be free / whatever they want to be.
That diversity of API approach is super valuable and easily mollified by your
own custom adapter logic.

~~~
travisoliphant
Bias warning --- I'm part of the group doing this work. For an end-user, your
arguments are fine. A single developer or team can indeed use whatever tool
they want and adapt their code without too much difficulty.

The real challenge comes, however, as you try to build an ecosystem of
libraries on top of multiple different APIs. You end up reinventing the wheel
or struggling to maintain multiple versions of the same library.

Some cooperation is immensely helpful at the lowest level, and we hope
ultimately to be helpful to the different tool builders in providing
information and guidance about what is already a standard and agreed upon in
the space.

This is not about establishing or forcing behavior. It is about documenting
and clarifying best practices that have already emerged as well as potentially
giving libraries a way to signal to downstream developers of additional
libraries that they adhere to the standard.

~~~
mlthoughts2018
I don’t really disagree with most of your points, but what you describe does
not sound like it should be a committee with governance rules and member
libraries tied to an RFC process.

What you describe sounds like you could just create your own single library
like a Swiss army knife that contains adapters to rationalize the semantics if
you want, even at the ABI level if truly needed.

Member libraries can do a best effort support for not breaking that adapter
library’s way of wrapping them, and/or contribute patches that keeps the
adapter in sync with new releases of the library.

This makes the adapter layer become totally opt-in and backported rather than
required via consortium membership & RFC governance, and even leaves room for
competitor adapter tools to coexist that might do things differently.

Ultimately my concern with a governance model is that it creates political
power to compel the member libraries to do things that might go against the
needs of their users for the sake of the consortium.

I just don’t see a reason why that’s more valuable than letting all the
libraries do their own thing and let users choose or make their own custom
adapters.

This example of array APIs is nothing like say overall HTTP protocol or IEEE
floating point protocol, which are cases where a governing standard makes
sense. This is nothing like that in my opinion.

------
InfiniteRand
Given the algorithmic focus of the libraries being targeted, I think the main
benefit is if I could do X operations with library A and then use the result
to do Y operations with library B. It does seem like that is the intent (if I
am not mistaken), so this seems like a worthwhile effort.

One problem with generally using libraries is they work fine until they don't,
and once you've been working in one library for a while, all of your support
functions, and transformed data are all based around that library's api which
makes integrating a new one difficult. This could help with that type of
difficulty.

------
xtiansimon
I am so confused. Maybe it's because I'm a part-time self-taught programmer.
Are we _calling_ function signatures 'APIs' just so we can speak about the
topic of so much diversity in what on the face of it should all be the same
implementation? Because my idea of API is the endpoint for a closed system,
usually accessed over the wire.

Is this use of the term 'API' canonical or artistic license?

~~~
geertj
APIs are Application Programming Interfaces, and they have existed long before
the web and web services. The article's use of APIs is correct. A really well
known example of an API is the C library.

~~~
xtiansimon
How is this discussed in computing textbooks--is there a chapter I can read
and catch up?

------
Tarq0n
I wonder if this risks ossifying API's at the lowest common denominator. Both
the numpy and pamdas API's are far from ideal, see for instance the recent
developments towards named tensors.

------
pyuser583
Kind of ironic that none of these data structures are in standard library.

Yes, the standard library contains NumPy-like arrays, but nobody uses them.

------
mikkelam
Sort of relevant [https://xkcd.com/927/](https://xkcd.com/927/) ;)

All joking aside, I applaud the effort and would love for the project to have
success. I can never remember that tensorflow calls it reduce_mean..

------
simonw
My Datasette project has given me a slightly different perspective on API
standards - at least in terms of things like shared definitions of XML, JSON
shapes, RDF schemas etc.

Once you've loaded your data into a SQLite database and put Datasette in front
of it, your audience can get the data out in any shape that they like.

Consider the museum data in my [https://www.niche-
museums.com/](https://www.niche-museums.com/) website.

Using a SQL query, I can get it out as an array of objects with name, url,
address, latitude, longitude: [https://www.niche-
museums.com/browse.json?sql=select+name%2C...](https://www.niche-
museums.com/browse.json?sql=select+name%2C+url%2C+address%2C+latitude%2C+longitude+from+museums+limit+10&_shape=array)

Or maybe I want to get back the name as "title" and the latitude longitude as
lat / lng. I can do that with a different SQL query:

    
    
        select name as title, latitude as lat, longitude as lng from museums limit 10
    

Here's that as JSON: [https://www.niche-
museums.com/browse.json?sql=select+name+as...](https://www.niche-
museums.com/browse.json?sql=select+name+as+title%2C+latitude+as+lat%2C+longitude+as+lng+from+museums+limit+10&_shape=array)

Let's say I want it as an Atom feed. Datasette plugins can add different
output renderers, so I can get back an Atom feed like this:

[https://www.niche-
museums.com/browse.atom?sql=SELECT%0D%0A++...](https://www.niche-
museums.com/browse.atom?sql=SELECT%0D%0A++%27tag%3Aniche-
museums.com%2C%27+%7C%7C+substr\(m.created%2C+0%2C+11\)+%7C%7C+%27%3A%27+%7C%7C+m.id+as+atom_id%2C%0D%0A++m.name+as+atom_title%2C%0D%0A++m.created+as+atom_updated%2C%0D%0A++%27https%3A%2F%2Fwww.niche-
museums.com%2Fbrowse%2Fmuseums%2F%27+%7C%7C+m.id+as+atom_link%2C%0D%0A++coalesce\(%0D%0A++++%27%3Cimg+src%3D%22%27+%7C%7C+m.photo_url+%7C%7C+%27%3Fw%3D800%26amp%3Bh%3D400%26amp%3Bfit%3Dcrop%26amp%3Bauto%3Dcompress%22%3E%27%2C%0D%0A++++%27%27%0D%0A++\)+%7C%7C+render_markdown\(%0D%0A++++m.description+%7C%7C+%27%0D%0A%0D%0A%27+%7C%7C+coalesce\(%0D%0A++++++\(%0D%0A++++++++select%0D%0A++++++++++group_concat\(%0D%0A++++++++++++%27*+%5B%27+%7C%7C+json_extract\(p.value%2C+%27%24.title%27\)+%7C%7C+%27%5D\(%27+%7C%7C+json_extract\(p.value%2C+%27%24.url%27\)+%7C%7C+%27\)+%27+%7C%7C+json_extract\(p.value%2C+%27%24.author%27\)+%7C%7C+%27%2C+%27+%7C%7C+json_extract\(p.value%2C+%27%24.publication%27\)+%7C%7C+%27%2C+%27+%7C%7C+json_extract\(p.value%2C+%27%24.date%27\)%2C%0D%0A++++++++++++%27%0D%0A%27%0D%0A++++++++++\)%0D%0A++++++++from%0D%0A++++++++++json_each\(coalesce\(m.press%2C+%27%5B%7B%7D%5D%27\)\)+as+p%0D%0A++++++\)%2C%0D%0A++++++%27%27%0D%0A++++\)%0D%0A++\)+%7C%7C+coalesce\(%0D%0A++++\(%0D%0A++++++select%0D%0A++++++++group_concat\(%0D%0A++++++++++%27%3Cp%3E%3Cimg+src%3D%22%27+%7C%7C+json_extract\(ph.value%2C+%27%24.url%27\)+%7C%7C+%27%3Fw%3D400%26auto%3Dcompress%22%3E%3C%2Fp%3E%27%2C%0D%0A++++++++++%27%27%0D%0A++++++++\)%0D%0A++++++from%0D%0A++++++++json_each\(coalesce\(m.photos%2C+%27%5B%7B%7D%5D%27\)\)+as+ph%0D%0A++++\)%2C%0D%0A++++%27%27%0D%0A++\)+as+atom_content_html%2C%0D%0A++%27Simon+Willison%27+as+atom_author_name%2C%0D%0A++%27https%3A%2F%2Fsimonwillison.net%2F%27+as+atom_author_uri%0D%0AFROM%0D%0A++museums+m%0D%0Aorder+by%0D%0A++m.created+desc%0D%0Alimit%0D%0A++15)

That's using a convoluted SQL query to define exactly what I want back in that
feed.

The point is: if you have a sufficiently flexible API gateway in front of your
data, the people consuming your data can define the "standard" that they want
to retrieve it in at runtime.

I think that's a really interesting characteristic. It's made me less excited
about boil-the-ocean attempts at defining a single standard and getting
everyone to support it at once.

[ This thread may not be the best place to raise this idea, since the Python
Data API Standards project isn't really in the same realm as the web API
formats I'm talking about here ]

~~~
travisoliphant
I agree that interfaces like Datasette are a fantastic idea. But, it's not the
same thing. Think of this more like the SQL standard. What would you do if SQL
did not exist and there were no common ways to discuss querying databases?

~~~
simonw
100% agree, this isn't the right thread for my random thoughts on web APIs!

~~~
sradman
I don’t think your ideas are random nor out of place. Durable data accessible
by different clients is an important property that bleeds into the data
science realm. Apache Arrow seems to partially tackle this problem by moving
the data to shared memory that can be accessed by different runtimes like
Python and R.

There are a couple of problems I see with SQLite used as a common data format:

1\. SQLite is a state machine that only stores the most recent version of
mutable (binary) tables so change control is non-trivial.

2\. Matrices/Tensors are not naturally stored in relational tables.

