
PgSync: Sync Postgres data between databases - aeontech
https://github.com/ankane/pgsync
======
pritambaral
I think pgloader does it better:
[https://pgloader.readthedocs.io/en/latest/ref/pgsql.html](https://pgloader.readthedocs.io/en/latest/ref/pgsql.html)

~~~
durkie
what makes it better?

~~~
dfsegoat
A few things I really like about pgloader (which to be clear, uses the COPY
protocol):

\- Handles MySQL -> pg; MSSQL -> pg; And pg -> pg copy

\- Very fast (lisp/"real threading")

\- Syncs schemas effectively (this says data only)

\- Has it's own DSL for batch jobs: specify tables to include/exclude,
renaming them on the fly in dest, and cast dtypes between src and dest if
needed. etc

~~~
pampa
I tried it once, could not make it work.

I tried migrating a smallish mysql database (~10Gb) to postgres and it always
crashed with a weird runtime memory error. Reducing the number of threads or
doing it table by table didnt help.

~~~
chucky_z
I found compiling with Clozure CL instead of SBCL leads to way better memory
allocation/usage.

It's worth getting it going, it's one of the greatest tools in migrating
things into and out of pgsql I've ever used. If you have pgsql in your
pipeline, get it working.

I used this and debezium to really make my ETL pipeline absolutely
bulletproof.

~~~
dfsegoat
> _I used this and debezium_

This looks very cool, was not aware of it:

"Debezium is a set of distributed services that capture row-level changes in
your databases so that your applications can see and respond to those
changes."

[https://debezium.io/documentation/faq/#what_is_debezium](https://debezium.io/documentation/faq/#what_is_debezium)

~~~
chucky_z
It has a lot of required components, however I ran most of them in a single-
component setup (e.g.: 1 zookeeper server , 1 kafka server; using only their
provided containers) and got extremely far (think 1 billion rows a day
ingested, then transferred to pgsql and s3) with some extra overhead still.

------
sudhirj
Postgres 10+ now has logical replication features built in, which makes that a
better bet for near-zero-downtime replication. Especially if you're trying to
move a database geographically or to a different machine or Postgres version,
without the downtime that a full dump and restore requires.

[https://www.postgresql.org/docs/10/logical-
replication.html](https://www.postgresql.org/docs/10/logical-replication.html)

This isn't rsync style two way sync, though, it's more for one way streaming /
ongoing replication, although you could technically set up a master-master or
replication chain if the conflicts are dealt with correctly.

------
hinkley
Last time I tried to set up pg, I had a notion of a read-mostly website and I
intended to feed it with a read-only replica and squirrel away the master
someplace secure with only admin access. But it turns out that pg slaves
contact the master, and not the other way around.

I never did find a solution I could live with. I'm trying to start that
project back up right now.

------
ahachete
How does it work in the event of real time changes happening on the
database(s)?

It appears not to use logical decoding/replication. If so, how it does sync
the data? It sounds like a hard problem, and not very efficient, to do it
without logical decoding/replication.

I didn't see documentation about it. Unless.. it is intended to be used only
with offline databases.

~~~
massaman_yams
It's batch sync, not streaming or replication. By default, appears to dedupe
based on PK.

~~~
toomim
Is this two-way or one-way sync? It's sounding like it's one-way. I don't see
anything about how to deal with clobbers, for instance, if the same data is
changed in two different databases at the same time.

And "batch one-way sync" is better described as "copy".

So I think this is a "postgres database copy" tool. Also known as clone, or
backup. And as such, it's competing with the existing postgres database
cloning tools, like pg_dump. So how is this different than that?

~~~
massaman_yams
One-way, yes. pg's COPY command can't do upserts, this can.

The primary use case for this appears to be ETL. Pg_dump is more backup-
oriented and optimized for larger operations, vs. a bit more fine-grained in
most ETL processes.

------
natmaka
pg_comparator, albeit barely known, is a monster. For sync purposes think of
it as a rsync for PG.

It needs temp tables.

[https://www.cri.ensmp.fr/people/coelho/pg_comparator/](https://www.cri.ensmp.fr/people/coelho/pg_comparator/)

~~~
sgt
Can confirm pg_comparator is a very solid piece of software.

------
tomnipotent
Embulk has been around for a long time, and supports considerably more options
than just Postgres (especially since pgsync looks like it's just doing
statement-based data copying).

[https://github.com/embulk/embulk](https://github.com/embulk/embulk)

------
nickreese
A little late to the party, but this could possibly help someone.

I spent a few hours trying all of the options in this thread and I couldn't
find anything that also transferred functions and triggers which was vital for
my project.

Here is a little bash file that I whipped together to completely wipe the DB
and import it in.

[https://gist.github.com/nickreese/8eeb01e2f73fec2bf2dbfecf5d...](https://gist.github.com/nickreese/8eeb01e2f73fec2bf2dbfecf5dc6dd73)

Just have the variables in your .env file and it will wipe your local DB and
copy from production.

------
mhd
I'd really like some PostGres sync tool that would allow me to use something
like "copy-on-write", i.e. do an upsert on each row, unless some condition
evaluates to true, using both target and source data (e.g. target modified_at
> source modified_at). Would allow for some common date to be easily shared
between databases (better performance characteristics than e.g. a foreign
table/union view).

------
arcticfox
How does this compare to dumping and loading for full databases? I have a
number of cloning processes that do that, should I look into this instead?

~~~
jaxn
Can't use the local db while restoring, this should help with that (I hope).

------
blairanderson
This makes getting production data down to your local machine super quick.

Need user where id = 64418362 and associated orders, order_items, etc

Done in like 5 seconds

------
jaxn
I have a citus cluster in production and then some manual scripts to turn it
back into a single-tenant database for in-house reporting. Basically pgdump
and restore. This has to happen overnight since we can't run reports while it
is restoring.

I am excited to try this and see if it might be a better way to pull in what's
needed.

------
loganfrederick
Is there something specific in Postgres (perhaps in the binlog) that makes it
easier to do this than MySQL? Seems like there's more tools for this kind of
data loading than I've seen for MySQL.

BTW: Awesome work ankane! Cool dude too :)

~~~
rpedela
This tool uses COPY which is highly optimized and supports stdout and stdin.
The combination makes it relatively easy to copy tables around. I don't
believe MySQL INFILE/OUTFILE is as flexible.

------
chungus
I did the ghetto version of this yesterday: pg_dump -t tablename
sourcedatabase | psql targetdatabase

------
forgingahead
Related note to this library: Andrew Kane (@ankane) has released some great
tools that I've found useful, notably Blazer (for our use cases) -- good to
see this is another nice piece of work from him. Thanks @ankane!

------
metreo
Purely in Ruby as well... nice.

~~~
why-el
Ruby is just for plumbing, i.e. preparing queries and whatnot, it's still
pg_dump under the hood.

