
Show HN: Splitgraph DDN – Public PostgreSQL proxy to 40k+ datasets - mildbyte
https://www.splitgraph.com#
======
mritchie712
I'm getting:

    
    
        ERROR: Internal Splitgraph error CONTEXT: PL/pgSQL function schema_controller.fatal(text) line 3 at RAISE
    

When running:

    
    
        SELECT ":id", "vacancy_rates"
        FROM "brla-gov/census-demographics-xsrb-mxqt:latest"."census_demographics"
        LIMIT 100;

~~~
mildbyte
Thanks for the report!

Our error reporting could definitely be more informative here. I've just
looked at the logs and in this case the problem is that the upstream
government data portal ([https://data.brla.gov/](https://data.brla.gov/)) is
temporarily unavailable.

~~~
mritchie712
I see, so you hit the data in real time? Do you cache anything?

~~~
mildbyte
We hit it in real time for most datasets. A lot of government open data
portals are powered by Socrata [0] and we wrote a foreign data wrapper that
translates the query into their proprietary query language. If it's a dataset
that we host ourselves, there's no problem with the upstream being
unavailable.

We have a cache too, but it currently just hashes the query's AST. In the
future, we'll be looking at caching actual tables as well -- basically, we'd
store some regions of the remote table and selectively pull data from upstream
when a query comes in to fill out our view of what the remote table looks
like.

[0] [https://www.splitgraph.com/docs/ingesting-
data/socrata](https://www.splitgraph.com/docs/ingesting-data/socrata)

------
rco8786
Whoa this seems incredibly cool and useful.

------
chatmasta
Hi HN,

You may have seen our previous Show HN [1] for Splitgraph, a tool to build,
query and share PostgreSQL database snapshots. Today we are launching our
first iteration of the service we’re building around that core code.

We’re calling Splitgraph a “Data Delivery Network” (DDN). It’s an integrated
data catalog and distributed SQL caching proxy, built on the PostgreSQL wire
protocol, with value-adds like access control and query rewriting. It can
forward SQL queries from existing SQL clients to live upstream data sources,
or to versioned data snapshots known as "data images" (including JOINs across
sources).

For example, the public instance at Splitgraph.com indexes 40k+ public
datasets and lets you query them directly from your existing SQL client or BI
tool, without needing to install anything else. You can also consume
Splitgraph data through a REST API powered by PostgREST, and you can build
data images with `sgr` and push them to Splitgraph (e.g. it’s really easy to
import a CSV, turn it into an image, and push to Splitgraph).

The public endpoint is at `postgresql://data.splitgraph.com:5432/ddn` and we
provide instructions on getting credentials and connecting with a lot of
popular clients like DBeaver/psql/pgcli/Google Data Studio, as well as some
sample queries (you can jump right in at www.splitgraph.com/connect).

In the near future, we're planning to add more pluggable upstream data sources
to public/self-hosted Splitgraph, letting it proxy to data warehouses
(Snowflake, BigQuery, Redshift, etc.) and third-party SaaS APIs (Salesforce,
Google Analytics, etc). As a proxy, we’re well positioned to add services on
top like caching, granular access control, firewalling, query rewriting and
scheduled queries. As a catalog, we have a natural UI to implement management
of upstreams (e.g. sharing access to a table or view is as easy as clicking
“share”).

We think there is demand for a data catalog, and we think combining it with a
SQL access endpoint opens a lot of opportunities. In particular, we expect
access control over disparate data sources to be a leading use case. Companies
are rarely able to keep all their data in just one warehouse, for one reason
or another, so adding an aggregation layer on top can make sense in some
scenarios.

We have a blog post [2] where we detail our vision for it.

Hope you try it out and would love to hear any feedback!

If you think you could use Splitgraph at your company, we’re in the midst of
developing a Private Cloud product and would love to talk. Reach out at
miles@splitgraph.com (chatmasta on HN) and artjoms@splitgraph.com (mildbyte on
HN).

[1]
[https://news.ycombinator.com/item?id=23627066](https://news.ycombinator.com/item?id=23627066)

[2] [https://www.splitgraph.com/blog/data-delivery-network-
launch](https://www.splitgraph.com/blog/data-delivery-network-launch)

~~~
hodgesrm
Anyone interested in hearing more please join the next San Francisco
ClickHouse meetup. The SplitGraph folks will be doing a presentation on
integration of open data with ClickHouse.

[https://www.meetup.com/San-Francisco-Bay-Area-ClickHouse-
Mee...](https://www.meetup.com/San-Francisco-Bay-Area-ClickHouse-
Meetup/events/272508526/)

~~~
chatmasta
We're excited for it :)

In the meantime, if anyone wants to query Splitgraph data from ClickHouse, we
have specific instructions for that here:

[https://www.splitgraph.com/connect](https://www.splitgraph.com/connect)

------
mritchie712
How much does this cost?

~~~
mildbyte
Do you mean to run or to buy?

To run: this is very lean (the whole stack, including our public website, our
REST API etc is currently running on a ~60EUR/pcm Scaleway instance). This is
because for most datasets we proxy queries to upstream government data portals
(there's a few datasets that we host ourselves). So the only cost is compute
and storage costs for a cache of frequent queries.

To buy: we're currently developing a self-hostable deployable version of this,
except it will be an internal proxy that forwards queries to your data
warehouse, third-party SaaS etc, with extra services on top like access
control/caching/scheduled queries. We aren't selling it yet, but we do want to
work closely with a few potential clients to prioritize feature development.
You can read more about our plans at [0] if you're interested!

[0] [https://www.splitgraph.com/about/company/private-cloud-
beta](https://www.splitgraph.com/about/company/private-cloud-beta)

