
The Splitgraph Data Delivery Network – query over 40k public datasets - mildbyte
https://www.splitgraph.com/blog/data-delivery-network-launch
======
emersion
>The single SQL endpoint is well suited for a data marketplace. Data vendors
currently ship data in CSV files or other ad-hoc formats. They have to
maintain pages of instructions on ingesting this data. With Splitgraph, data
consumers will be able to acquire and interact with data directly from their
applications and clients.

I appreciate the effort to make it easier for users to access heterogeneous
data sets, but I really hope data vendors keep shipping raw CSV files. I don't
want a company to gate access to the data, merely offering a proxy. I want to
be able to download the whole raw datasets from the vendor directly if I want
to.

~~~
cbetti
In my view our collective interest in CSV as a medium for data distribution
has resulted in far too much information loss, and consequently, time wasted
on input sanitization, validity checking, and unresolvable conversations about
the intent of data values like ”1.12345E+11” and "".

~~~
paulgb
Having spent way too much time wrangling vendor-provided CSVs, I 100% agree.
I'd love for there to be a common, well-understood format for typed tabular
data that supports multiple tables and enforces foreign keys between them.
Ideally with a concept of "patching" to enable incremental updates.

Probably the closest thing I'm aware of is handing around a sqlite file, but
I'm a little uneasy using a format that's meant to be a database as a transfer
format. Dolt looks promising here too. Are there other ways?

~~~
Fiahil
At work, we use Parquet
([https://parquet.apache.org/](https://parquet.apache.org/)) for almost
everything related to a dataframe. We don't really care about performance
gains (although, it's nice to have), but we really like to have a schema.

Note, we use mostly Python, some R, and a various range of ML or Optimisation
tools, depending on the project.

~~~
xyzzy_plugh
Parquet is kind of a royal pain in the ass compared to CSV/JSON/plaintext
mostly because it uses a ton of Thrift encodings, resulting in mostly
terrible/broken implementations anywhere outside of the Java/JVM ecosystem. If
you're running Apache <Whatever> then sure, it'll probably be fine, but I'd
recommend avoiding it if you start having to go down the rabbit hole of
implementing support for things in your language du jour.

~~~
Fiahil
The Rust and python impl are fine. But I get it, Parquet may not be perfect or
optimal or whatever. It works as a simple, typed, columnar format.

We had to pick a single file format recommendation for sending 100GB+ tables
on FTP servers or dropbox, scanning terabytes of useless stuff only to grap an
key-value pair, and properly reading integer and UTF-8 columns. Turns out,
Parquet is practical. Enough for users to start using it instead of CSV. It
could be Avro, but it's just not as easy.

~~~
xyzzy_plugh
> But I get it, Parquet may not be perfect or optimal or whatever.

I actually think Parquet is pretty great in practice, I just have some issues
with the sheer volume of abstractions necessary to implement it. I just wish
it was anything other than Thrift.

I would probably choose Parquet over anything else, though.

------
lifeisstillgood
I see a new job title coming into being - Enterpruse Data Librarian

40,000 data sets - even if many are just diff versions - is a ridiculous
number to manage or even know about on a non full time basis.

Data driven decisions need data yes, but they also need people to know the
data exists. And what it means.

And this is just external curated data - use this as the standard for what
each department should be producting internally.

In fact that's a good idea - a data publishing standard - not just the data
types / schema, but actually supplying it through a format that is consumable
by others.

~~~
bmh100
This job already exists as either Business Intelligence Business Analyst or
Business Intelligence Developer. The professions have existed for decades.

------
Fiahil
As someone who tried, and almost succeeded, to get rid of pachyderm for the
last two years, I like what I just read.

Something is not entirely clear to me right now: An image is an immutable
snapshot of a dataset at a given point-in-time - great - but, can I query the
same dataset at two different PIT using layered querying in SQL ? Something
like this: SELECT * FROM dataset:version-1, dataset:version-2

Also, are you storing the entire dataset as new or only the diff between
versions (and later reconstruct the full image) ?

Now, onto the things that could be improved...

\- Git-like semantics (pull, push, checkout, commit) are poorly suited for
versioned, immutable datasets. Just (intelligently) abstract fetching and
sending datasets by looking at the SQL query (dataset:version-2, above)

\- Versions should be at least partially ordered and monotonically increasing.
Hashes doesn't convey the information necessary to decide if dataset:de4d is
an earlier version of dataset:123a, or not.

\- Tracing a derived dataset provenance will only work if you can assert that
the "code" or transformations applied to the original dataset is deterministic
(side-effect free). So, either you have your own ETL language that you can
execute in a sandbox and add a myriad of useless stuff for creating and
scheduling pipelines (please don't do that!), or you just let it go and don't
end up becoming Pachyderm (sounds great!).

~~~
didip
What's wrong with Pachyderm? Sounds like you are having a lot of pain with
them.

I am not affiliated at all with Pachyderm, I am just curious about this
problem-space.

~~~
Fiahil
There is a good number of specific reasons, but it really comes down to this :
they have too many features.

------
eadan
This is very cool. Relatedly, as a data scientist, I wish companies would
expose their APIs through SQL. I've spent a lot of time pulling data into ETL
jobs from things like mixpanel, adwords etc., and having a unified interface
would make things much simpler.

I'm trying to understand the architecture of Splitgraph. Are all foreign data
wrappers controlled directly by you, or can third parties host a database and
connect it to Splitgraph in a federation?

~~~
mildbyte
Currently we control and set up all the FDWs (well, an orchestration layer
does it on the fly as the query comes in and routes the query to the correct
schema with foreign tables).

You can also run a Splitgraph engine locally and add your own FDWs to it. We
have a lot of scaffolding around FDWs to make their instantiation much more
simple and even wrote a blog post [0] about adding a custom FDW to Splitgraph.

However, in the future we'll be adding the ability to add your own backend
data sources to Splitgraph that it can proxy to (whether as a private dataset
on the public Splitgraph instance or as a "data virtualization" layer when you
have an in-house Splitgraph deployment).

The cool thing about this is that this can be a single gateway to all your
data silos (Snowflake, third-party SaaS, public datasets) that can handle
federated query execution, data discovery and access control (e.g. firewalling
queries to sensitive columns even if the backend data source doesn't support
this level of granularity).

[0] [https://www.splitgraph.com/blog/foreign-data-
wrappers](https://www.splitgraph.com/blog/foreign-data-wrappers)

------
big-malloc
Is there a CPU limit or timeout for queries? I’d be a little concerned that an
intentionally slow and inefficient query could pin the CPU at 100% and ruin
the performance for other users

~~~
mildbyte
We currently limit all queries to 30s of execution and 10000 rows returned (by
adding a `LIMIT` clause to queries that don't have it). We also have some
mechanisms like query result caching and rate limiting for better QoS. One of
our directions is building a basically CDN for databases, so it's good to
figure these things out as early as possible.

~~~
big-malloc
Interesting! I’ll be keeping an eye on this project, I love postgres

------
codetrotter
> postgresql://data.splitgraph.com:5432/ddn

That’s actually pretty cool, to see a public URL with PostgreSQL protocol
signifier like that.

Makes me wonder if any developers or DB Architects ever thought of putting
their resume in a DB and putting a public read-only postgresql:// URL on their
business card :D

~~~
wdfx
I would recommend you don't do that. Nobody who is reviewing candidates has
time to write a client to get your raw information out of a database.

If _you_ build the client, and a web page/app and also document how you did it
and also show us the code, that'd be much more useful.

~~~
rad_gruchalski
I'm sorry, what. If you can't query that, why should I work for you? ;)

~~~
_jal
I wouldn't expect much from going this far, but yes, too many HMs either
forget or never realized they're being interviewed, too.

Saw this more before I started refusing interviews with the HugeCos, but the
strangest flavor of this I've run in to was at an early startup - I suspect
dude was acting out an arrogant genius script, hoping to convince people he
was one. (It came out as very weird, not in a good way.)

------
jarym
Very neat indeed. I thought Postgres had a max identifier length of 63
characters so I was surprised to see _" cityofchicago/covid19-daily-cases-
deaths-and-hospitalizations-
naz8-j4nc".covid19_daily_cases_deaths_and_hospitalizations_ in the FROM part
of the statement. Does the max identifier length not apply for some reason
here or have Splitgraph done something to increase it?

On a related note, I've long wanted longer identifier lengths in Postgres so
we can have more meaningful column names but the powers-that-be have always
refused... hopefully one day it'll increase in the default distribution.

~~~
mildbyte
Co-founder here. The 63-char limit still applies (we didn't recompile
Postgres!) but we have some code in front, embedded in a layer of PgBouncers,
that intercepts the query, parses it and rewrites it into a shorter dataset ID
hash that we then "mount" on the database on the fly using Postgres FDWs
before forwarding it.

We also use this to drop unwanted queries and rewrite clients' introspection
queries (e.g. information_schema.tables) to give them a list of featured
datasets instead of normal Postgres schema names.

~~~
marton78
Nice! And how do you parse SQL? Do you use Postgres's parser or did you have
to implement one yourself?

~~~
mildbyte
We use pglast [0]: it's basically a Python wrapper around Postgres's query
parsing code.

[0] [https://github.com/lelit/pglast](https://github.com/lelit/pglast)

------
touisteur
I think FDWs are not more used because they're not easy to get into.

The best link/example I found was [https://github.com/beargiles/passwd-
fdw](https://github.com/beargiles/passwd-fdw) and it's quite easy to follow
the code and understand all the moving parts.

Once you've written a FDW you'll see them everywhere.

In fact the same author wrote
[https://github.com/beargiles/zip](https://github.com/beargiles/zip) file-fdw
and [https://github.com/beargiles/tarfile-
fdw](https://github.com/beargiles/tarfile-fdw)

If you (still?) need inspiration and want to see what already exists:
[https://wiki.postgresql.org/wiki/Foreign_data_wrappers](https://wiki.postgresql.org/wiki/Foreign_data_wrappers)

Simpler, using a 'generic' file-based FDW shipping with pg's sources:
[https://aaronparecki.com/2015/02/19/8/monitoring-cpu-
memory-...](https://aaronparecki.com/2015/02/19/8/monitoring-cpu-memory-usage-
from-postgres)

There's a python wrapper to get your feet wet (or prototype an idea) :
[https://github.com/Segfault-Inc/Multicorn](https://github.com/Segfault-
Inc/Multicorn) (though I'm not sure how maintained this is).

The only annoying part is that you're plugging your code to an interface that
might (and has sometimes) broken between releases of PG. So kind of the same
fun as maintaining a gcc plugin...

By the way, anyone has any idea on the licensing terms/issues of PG FDWs and
PG extensions in general?

~~~
chatmasta
FDWs are a powerful feature. We’ve done a lot of work to make scaffolding them
easier, if you use sgr. One of our earlier blog posts includes an example of
making an FDW for the HN API and packaging it as a Splitgraph mount handler.
[0]

There’s also this great post about using FDWs for parallel query execution, by
David Grier at Swarm64. [1] They seem to be doing cool things with FDWs too.

Personally I think proxies are a really powerful abstraction in general.
Cloudflare and Twilio are two examples of companies built around proxies.

[0] [https://www.splitgraph.com/blog/foreign-data-
wrappers](https://www.splitgraph.com/blog/foreign-data-wrappers)

[1] [https://swarm64.com/post/scaling-elastic-postgres-
cluster/](https://swarm64.com/post/scaling-elastic-postgres-cluster/)

~~~
touisteur
Thanks for both links!

I'm not sure where you're going with this product, but I like the idea of
proxies and your idea of a DDN, and I wish you the best.

I'm trying to 'bind them all' at my job right now and Postgres is very
inspiring with its extension success stories, and all the FDW work happening.
And once you can bind to C, you can mostly chose your language :-)

------
dumbfounder
Ok I signed up and used your recommended client (DBeaver 7.1.5) but I don't
see the schemas in your picture.

[https://ibb.co/gwLfHVz](https://ibb.co/gwLfHVz)

~~~
chatmasta
Sorry about that! I just tested with the same DBeaver version and was unable
to reproduce the bug. If you email support@splitgraph.com with your username,
we can check the logs and figure out the problem for you.

In general, some clients will have issues with introspection queries (which
they use to show the list of available schemas). And even when introspection
works, we can't show you all 40k datasets or it might break your client. So we
just show "featured" datasets, and for a more exhaustive list you can go to
the website to find a dataset.

But, you can still query any dataset on Splitgraph regardless of what shows in
your client's schema list. You can go to
[https://www.splitgraph.com/explore](https://www.splitgraph.com/explore) to
find datasets, and from each dataset's repo page, you can click "tables" or
"query SQL" to get a sample SQL query to run.

~~~
thibautg
Same issue with DBeaver 7.1.5 on Windows 10. I've sent an email to
support@splitgraph.com.

Congrats on the product BTW. Very promising!

I've been looking forward to something like this for years.

~~~
thibautg
It's fixed now, thank you!

------
varelaz
Thanks for opening new way to work with public data and discover it. I have
several ideas regarding this. I used public free APIs and the worst thing with
them that they are all unreliable. Unrelaible on conditions, limits and
usually don't scale. And you cannot blame API providers because you don't pay
for it. I vote for premium resource based access to the data with free tier.
When you can pay and have level of service you need, or can use tiny free
limited access.

------
geordee
SQL is the API for data.

~~~
jarym
'Data' is quite broad. If you need results in a tabular format I agree with
you - without doubt SQL is the API.

But for nested data (XML, JSON, etc.) it really isn't the best language for
that. I am talking here specifically about not querying data that is nested,
but actually getting query results in a nested format. SQL can do it (almost
all major databases have XML and JSON support) but it rally isn't the easiest
thing to use.

~~~
geordee
Right. As the name implies, for structured data. Another way of looking at is,
that if we make efforts to structure the data, SQL is pretty good.

We should also consider the capabilities of PostgreSQL to query JSON
documents.

------
dsr_
I can find a privacy policy. It's not awful.

I can't find pricing.

~~~
chatmasta
Hey, co-founder here. There's no pricing yet as we've just launched. Our plan
is basically a public/private instance model, kind of like GitHub. The public
instance will eventually have quotas (definitely for storage, maybe for
server-side query execution). But the main product will be private deployments
of Splitgraph, for companies that want a catalog for their internal data.
Eventually you'll be able to use the web UI to connect arbitrary upstream
sources (Snowflake, BigQuery, SaaS, etc.) to the catalog. You'll be able to
manage services on top of each source (e.g. caching, snapshotting, access
control, query rewriting, firewalling) and share data with colleagues from the
web UI. Basically we can provide a sort of aggregation layer on top of your
disparate data sources. We think combining the proxy with a catalog is a
really powerful combination.

We're actually looking for early customers to beta test the private
deployment, more on that here:
[https://www.splitgraph.com/about/company/private-cloud-
beta](https://www.splitgraph.com/about/company/private-cloud-beta)

p.s. Glad you like our privacy policy at least. :) We hate third party
trackers.

------
sradman
Does the Splitgraph Data Delivery Network allow queries that ORDER BY an
unsorted column? This seems like a vector for a Denial of Service attack.

------
dariosalvi78
mm interesting... we have this open Postgres instance (read only) for covid19
research: [https://covid19.eng.ox.ac.uk/](https://covid19.eng.ox.ac.uk/)

we have it running on our own (cheap) server, but we fear we may get
overwhelmed by too much traffic if the project becomes very successful. Would
this be a solution for us? Is it for free?

~~~
chatmasta
Very cool! This is a great use case for Splitgraph. We'd be happy to help you
deliver that data. The easiest method would probably be for us to proxy
queries to your Postgres instance (you can't do this yourself from the website
yet, but it's a planned feature, and we can work with you to set it up), and
then you could benefit from our connection pooling and caching. Another option
would be for you to push the data to Splitgraph as an image (to keep up to
date, you can setup a local Splitgraph instance as a PG replication client and
periodically `sgr commit` a new image [0]). If you'd like to chat details,
feel free to email support@splitgraph.com or join the Discord
([https://discord.gg/eFEFRKm](https://discord.gg/eFEFRKm)).

In terms of price, we'll eventually add quotas (storage + server side query
execution) on the public tier. But the main monetization will be private
deployments. In an ideal world, the private deployments will be able to
subsidize the costs of some of the open data on the platform. Certainly we'd
like to be able to support projects like this one.

[0]
[https://github.com/splitgraph/splitgraph/tree/master/example...](https://github.com/splitgraph/splitgraph/tree/master/examples/pg-
replication)

~~~
dariosalvi78
Excellent, thanks a lot for this! I will bring your suggestions to the rest of
the team and come back to you via Discord.

------
georgewfraser
Postgres foreign data wrappers is a weird choice of engine. Most queries to
this service will be scans, in which case a column-oriented, vectorized,
massively parallel engine like Presto will be 1000 times faster or so.
Postgres’ underlying engine is optimized for scenarios where you read a small
number of rows using an index.

~~~
mildbyte
Hey George, thanks for the comment and for the good points!

We want to initially focus on the use case with lots of diverse backend
datasets and ad-hoc APIs (maybe with a no-code like solution on top of a
generic FDW) where performance won't be the bottleneck. If necessary, the
backend data sources can perform aggregation and fast query execution. For
example, you can also put Splitgraph in front of Presto (through JDBC). The
value we want to provide in these cases is:

* granular access control (e.g. masking for PII columns, auditing etc)

* firewalling/query rewriting/rate limiting (for publicly accessible endpoints that proxy to internal databases that vendors want to publish more easily than through cronjobs with data dumps)

* cataloguing (so you get to discover datasets/data silos, get their metadata and query it over multiple interfaces in the same product)

We also like keeping the PG wire format in any case, as there are so many BI
tools and clients that use it that it makes sense to not break that
abstraction. We started with PG FDWs just because of the simplicity and the
availability of FDWs, but we might swap the actual Postgres FDW layer for some
faster execution in the future, if it's needed.

Behind the scenes, for Splitgraph images, we use cstore_fdw as an intermediate
storage format (it's a columnar store similar to ORC with support for all PG
types like PostGIS geodata). There's a potential in using this as a format for
a local query/table cache on Splitgraph nodes that we intend to deploy around
the world for low-latency read-only query execution.

------
ekzhu
How do you handle expensive queries? Several JOIN over multiple large data
sources can easily take minutes if not hours.

------
CharlesDodgson
Looks lovely, I can see real use for this in my work, postgres and the
availabilty of postgis extension is really useful for mapping data and
spatially realted queries.

------
Vaslo
Can other Database systems be used to do this like SQL Server or Oracle? Many
of us are forced to use systems other than PostGres.

~~~
chatmasta
I'm not sure if you're asking about (a) querying Oracle from a Postgres client
through Splitgraph, or (b) querying Splitgraph from Oracle.

We want to support both these use cases. For (a), Oracle would be an
"upstream" to Splitgraph. We'll need to write a plugin that implements the FDW
and does introspection. Eventually, we want you to be able to configure
upstreams from the Web UI.

For (b), you can probably find a way to query Splitgraph from Oracle, e.g.
using Oracle's "gateway" feature [0]. What's nice about Splitgraph is that
it's compatible with any SQL client that can speak the Postgres protocol (or
ODBC). So if Oracle can connect to a Postgres database, it can connect to
Splitgraph.

We have instructions for how to query Splitgraph from within ClickHouse at
[1]. We're actually giving a presentation about this to a ClickHouse meetup on
Sep 10, feel free to join. [2]

[0]
[https://docs.oracle.com/cd/E18283_01/owb.112/e10582/gateways...](https://docs.oracle.com/cd/E18283_01/owb.112/e10582/gateways_connection.htm)

[1] [https://www.splitgraph.com/connect](https://www.splitgraph.com/connect)

[2] [https://www.meetup.com/San-Francisco-Bay-Area-ClickHouse-
Mee...](https://www.meetup.com/San-Francisco-Bay-Area-ClickHouse-
Meetup/events/272508526/)

~~~
hodgesrm
SF ClickHouse meetup organizer here. Thanks for the shout-out for the SF
ClickHouse Meetup. We're looking forward to hearing about SplitGraph on
September 10th.

------
molszanski
Great project! Congrats with the launch.

