
Foreign data wrappers: PostgreSQL's secret weapon? - chatmasta
https://www.splitgraph.com/blog/foreign-data-wrappers
======
claytonjy
Off-topic, but are FDW's a reasonable way to migrate data between databases?

We have most of our data in a Postgres DB with a layout we like: multiple
"core" or low-level schemas and a higher-level API schema which is how all
other services interact with it (reads and writes). We also have a few much
smaller but very important additional SQL DBs (Postgres and MySQL). There's no
good reason all this data can't or shouldn't live together (it's really all
one "service"), and plenty ways it would make our architecture and dev work
simpler (we join across them all the time).

So, I figure for each smaller DB, I could expose it with an FDW from the DB we
like, and use those tables in that DB's API schema. Once we settle on good API
& usage patterns there, I would then copy data from the FDW tables into local
tables, update references in the API schema, and finally drop the FDW and
decom the other DB's.

Is that sane? I know the usual caveats about order-of-operations, potentially
writing to two places for some time to ensure no data is lost, etc. and am not
worried about those.

~~~
SOLAR_FIELDS
Back in my DBA days, I used the MSSQL equivalent, "Linked Servers", to set up
regularly scheduled ETLs. It's a nice way to reason about tables since you can
test your joins in real time. The only real implication here is performance,
as someone else mentioned in another comment, joins can be quite slow. In my
case my ETLs were usually once daily and the fact that they took 1 minute
instead of 1 second was not a huge deal.

~~~
michaelwda
I've also used this, combined with an agent job to keep data in sync between a
primary system of record and a replica.

In my case I was moving account number information between iSeries db2 and an
ArcGIS instance.

------
rattray
Perhaps the title should change to "Using PostgreSQL's foreign data wrappers
with Splitgraph" since this seems to be a tutorial to that effect.

~~~
te_chris
Every technical blog post is these days, all just part of one massive content
strategy grift. It's tiring. I miss the old days of good technical writers
writing useful blog posts which people read and appreciated.

~~~
pjungwir
For Postgres specifically, I've been enjoying Bruce Momjian's blog lately [1].
He posts some tidbit every day or two, and not trivial things you'd already
know. depesz.com is great too and really helps you stay on top of what's new
in Postgres. And the pg consulting companies post quite a lot of great
articles that aren't really about selling you something. (Obviously they would
be happy for you to hire them, but that's pretty indirect.)

[1]
[https://momjian.us/main/blogs/pgblog/2020.html](https://momjian.us/main/blogs/pgblog/2020.html)

~~~
valuearb
Thanks for sharing.

------
dezzeus
I'm using Postgres FDW at my current work and, while it has its advantages and
use cases, JOIN operations can be terribly slow. Also, good luck (not) working
with remote sequences.

~~~
Shoreleave
I'm struggling with this at the moment. Any tips to make JOINs faster?

And a question for any postgres people: say I have two distinct databases, but
now need to join across them. What's best practice here?

~~~
mildbyte
There's some query planner tweaks you can use to speed up JOINs with FDWs [0].
In layered querying [1], we had an issue with the planner choosing nested loop
joins (which essentially run as multiple small single-row fetches) which tank
performance if starting a scan has a large latency overhead. This can happen
if the FDW underreports its startup cost.

If you use `SET enable_nestloop=off`, this will disable them for that session
and use alternative strategies (like hash or merge join) which might be
faster.

[0] [https://www.postgresql.org/docs/current/runtime-config-
query...](https://www.postgresql.org/docs/current/runtime-config-query.html)

[1] [https://www.splitgraph.com/docs/large-datasets/layered-
query...](https://www.splitgraph.com/docs/large-datasets/layered-querying)

------
omani
you just took a feature of postgresql and mentioned it in your article to sell
your product.

you didnt answer your own question. is it a secret weapon?

~~~
chatmasta
Splitgraph co-founder here. Splitgraph is based on PostgreSQL, so naturally
all of its features are part of Splitgraph as well. We want to add value to a
battle-tested RDBMS rather than inventing a database from the ground up.

In particular, we really like the feature of foreign data wrappers, and it
forms the basis of one of our core abstractions ("mounting" upstream data)
[0]. We've added a lot of scaffolding to make writing and using FDWs easier
with a Splitgraph engine (which is just Postgres with the Splitgraph library
loaded into it). You can also IMPORT from upstream sources using Splitfiles,
to pack the data into a versioned “image” (analogous to Dockerfiles and docker
images).

As for whether they're a secret weapon? Well, they're not so secret, but it
does seem like they're underutilized considering the power they grant. We
certainly like them.

[0] I wrote a post today about how we use a Socrata FDW to "mount" 40k+
government datasets, making them all available via SQL:
[https://www.splitgraph.com/blog/40k-sql-
datasets](https://www.splitgraph.com/blog/40k-sql-datasets)

~~~
sradman
> As for whether they're a secret weapon? Well, they're not so secret, but it
> does seem like they're underutilized considering the power they grant. We
> certainly like them.

They are indeed powerful and underutilized. They also go by different names in
other RDBMSes (proxy tables, linked servers, remote tables, virtual tables,
etc.). If I remember correctly, SQL Anywhere 6 released in 1998 was the first
rich implementation [1] of "proxy tables" to "remote servers" and Microsoft
soon followed with "Linked Servers". The embedded database libraries (Apache
Derby and SQLite) added rich APIs/SPIs to create virtual table plugins like
SQLite's FTS.

Using remote tables for federated queries across heterogenous data sources,
and for access to non-relational data sources, is a well established technique
and a powerful "weapon" for those who know how to wield it. It might not be
PostgreSQL's secret weapon but it may very well be Splitgraph's.

[1] [http://dcx.sap.com/index.html#1001/en/dbwnen10/wn-remote-
new...](http://dcx.sap.com/index.html#1001/en/dbwnen10/wn-remote-new601.html)

~~~
ako
I think oracle had database links before 1998 which enables you to have proxy
tables for tables in other oracle databases.

~~~
sradman
It looks like Oracle had CREATE DATABASE LINK since at least 7.3 [1] in 1996
(probably earlier too but I couldn't find when it was introduced). It wasn't
Oracle-only either:

> However, to access non-Oracle systems you must use Oracle Heterogeneous
> Services.

[1]
[https://docs.oracle.com/cd/A57673_01/DOC/server/doc/SQL73/ch...](https://docs.oracle.com/cd/A57673_01/DOC/server/doc/SQL73/ch4a.htm#toc087)

------
mindhash
This is something I used a lot with oracle. They call it remote dB link. I
often wonder the impact of this if this was vendor agnostic..especially for
ETL, the world will change

~~~
aidos
Postgres does dblink too (which is Postgres to Postgres I believe). FDW is
more general in that it allows using, say, mongo collections as tables.

------
polyrand
This reminds me of SQLite's Virtual Tables. You can even return data from an
HTTP API and make it look like data from a table.

------
thinkpad20
On the subject of foreign data wrappers, there's an extension for storing
columnar data in Postgres which is implemented via this concept:
[https://github.com/citusdata/cstore_fdw](https://github.com/citusdata/cstore_fdw)

I haven't used it myself but it's pretty cool that it's out there.

~~~
mildbyte
Splitgraph co-founder here. cstore_fdw is great (we even use it in Splitgraph
to store data [0])! It's not as fast as purpose-built columnar stores like
MonetDB. However, it plugs seamlessly into PostgreSQL and supports all types,
even those added via extensions. Read performance for OLAP workloads like
aggregations is better than PostgreSQL [1] and it has a much smaller IO load
and disk footprint (you can get long runs of similar values in column-oriented
storage, which compress better). As a nice bonus, you can simply swap
cstore_fdw files in and out of the database without having to "load" them into
PostgreSQL. We use this idea too to enable data sharing.

[0]
[https://www.splitgraph.com/docs/concepts/objects](https://www.splitgraph.com/docs/concepts/objects)

[1] [https://tech.marksblogg.com/billion-nyc-taxi-rides-
postgresq...](https://tech.marksblogg.com/billion-nyc-taxi-rides-
postgresql.html)

------
xwdv
If you like this you can find a lot of secret weapons in PostgreSQL just by
reading the documentation.

------
fullstop
I've only used them once, to be able to query some of pgbouncer's "tables"
from postgresql. It works like a charm, but I'm not concerned at all with
performance with queries such as these.

------
etaioinshrdlu
Could one run a separate Postgres instance for each table in a large schema,
and use FDW to stitch it all back together?

To what extent would this be noticeable to applications?

Could this be an interesting scaling/reliability strategy in some
circumstances?

~~~
mildbyte
Splitgraph co-founder (and post author) here. Most PostgreSQL clients don't
treat foreign tables any differently than real ones, but note that things like
FK constraints or triggers will have to be resolved on the remote server (your
app essentially talks to an adapter that rewrites queries and forwards them to
the remote database). This might not work that well as a scaling/reliability
strategy (since you're still sending queries to one central database) though.

But this is actually one of the cool use cases we built Splitgraph for: you
can mount a bunch of remote databases (doesn't have to be PG, can be
Mongo/MySQL etc), Splitgraph images and even remote datasets like Socrata[1]
into a single workspace and run e.g. JOINs between them. We optimize for the
OLAP (read-only) use case and have a special FDW for querying remote
Splitgraph images (we call this layered querying [0]). It downloads required
regions of the table in the background, completely seamlessly to the client
application. So you can spin up a lightweight Splitgraph engine at the edge
and point a PostgreSQL client to it. This will let you satisfy read-only
queries to huge remote datasets with a small local cache.

Re: compatibility, we have tested this setup with various analytics software
and PostgreSQL clients like DBeaver/Metabase/dbt[2] and it works pretty well.

[0] [https://www.splitgraph.com/docs/large-datasets/layered-
query...](https://www.splitgraph.com/docs/large-datasets/layered-querying)

[1] [https://www.splitgraph.com/blog/40k-sql-
datasets](https://www.splitgraph.com/blog/40k-sql-datasets)

[2]
[https://www.splitgraph.com/product/splitgraph/integrations](https://www.splitgraph.com/product/splitgraph/integrations)

------
mooneater
Love postgres. But for fdw don't you have to spell out the schema? That makes
it hard for quickly accessing arbitrary data. Would be nice if it would deduce
the schema.

~~~
CuriouslyC
[https://www.postgresql.org/docs/current/sql-
importforeignsch...](https://www.postgresql.org/docs/current/sql-
importforeignschema.html)

------
gfiorav
Beware of the performance hit if you do this over the internet. We used this
with PostGIS and data was copied back and forth through the network pretty
inefficiently.

------
mushufasa
what are the downsides to foreign data wrappers?

~~~
mildbyte
Splitgraph co-founder (and post author!) here. By far the biggest problem with
foreign data wrappers is that you're still forced into PostgreSQL's format of
treating and returning each tuple separately. There's some research being done
in PostgreSQL [0] with pluggable storage formats as an alternative to using
FDWs for querying. Also, Citus had a prototype that would hook into the query
planner to vectorize cstore_fdw aggregations [1]: this is also promising for
getting around some FDW restrictions.

FDWs aren't really that layperson-friendly: to set one up, you need to run a
lot of SQL boilerplate like CREATE FOREIGN SERVER, CREATE USER MAPPING and
CREATE FOREIGN TABLE. We wanted to make them more accessible through our sgr
mount [2] command and snapshottable via Splitfiles (e.g. [3]).

The point about data types changing is also a good one. Where the foreign data
wrapper doesn't implement IMPORT FOREIGN SCHEMA [4] through introspection on
the foreign database side, you have to enumerate all columns and types in
CREATE FOREIGN TABLE. If a column on the remote table goes away, the FDW might
still query it and cause runtime errors.

[0]
[https://wiki.postgresql.org/wiki/Future_of_storage](https://wiki.postgresql.org/wiki/Future_of_storage)

[1]
[https://github.com/citusdata/postgres_vectorization_test](https://github.com/citusdata/postgres_vectorization_test)

[2] [https://www.splitgraph.com/docs/sgr/data-import-
export/mount](https://www.splitgraph.com/docs/sgr/data-import-export/mount)

[3] [https://www.splitgraph.com/docs/ingesting-
data/socrata#split...](https://www.splitgraph.com/docs/ingesting-
data/socrata#splitfile)

[4] [https://www.postgresql.org/docs/current/sql-
importforeignsch...](https://www.postgresql.org/docs/current/sql-
importforeignschema.html)

~~~
asah
how's performance?

~~~
mildbyte
You always have the latency/bandwidth overhead from moving queries/data
between instances, but FDW performance can be surprisingly fast. There's a
performance-FDW complexity spectrum and you can choose a point on it that's
applicable to your use case.

At its simplest, the FDW can just return all tuples from the remote database
without filtering them (letting the local DB run filtering). But more advanced
FDWs like postgres_fdw[0] can push down qualifiers and joins to the remote
database. postgres_fdw even runs EXPLAIN on the remote instance and parses its
output -- essentially letting the local and the foreign query planners
collaborate on execution.

[0] [https://www.postgresql.org/docs/current/postgres-
fdw.html#id...](https://www.postgresql.org/docs/current/postgres-
fdw.html#id-1.11.7.42.13)

------
volkk
wrote a side project using FDW to query twitter like 7 years ago. i really
expected this part of postgres to never be talked about again, but alas. It
was super easy to use it once built, but when I looked at the source code of
the twitter one, it didn't seem super straightforward

------
moxylush
I found this kind of interesting, enough to want to learn more about
ProgreSQL.

