
Fast Way to Load Data into PostgreSQL Using Python - gilad
https://hakibenita.com/fast-load-data-python-postgresql
======
gigatexal
As a former DBA and closet python super fan I found this post to follow all
the things I would have done. Kudos to the op.

Loading to a temp non-logged table is the right way to go and using built in
bulk loading options is smart, too. The type annotations looked really good
and kudos for using modern python here (although for the examples I’m not sure
they were needed, but being that they could have been used in a larger context
that this article was just a microcosm if I could be wrong). Very good
article. I’ve bookmarked the blog to consume other such articles.

~~~
mbogda
What about translating data on the fly to the one big copy statement ?

~~~
gigatexal
I could see that working. Just depends on your use case and what you want to
accomplish.

The approach the author takes has the advantage that when data is in the load
table it can then be used to load into a real table via some transform in
batches and be done in the background.

------
louis8799
I haven't done any benchmarking, but for around 200,000 rows of data, which I
am handling daily at work, the following line of python code takes around 10
seconds.

pandas.read_json(json_obj).to_sql()

Any transformation of the data before writing into DB is split second with
vectorized pandas operation.

Pandas is still my first choice when it comes to tasks like this.

------
latch
Not sure what the python code translates to, so maybe this is covered, but in
my experience, if you can't use COPY, you should:

1 - Generate a static SQL to insert N rows (say 10 to 50): insert into blah
(id, name) values ($1, $2), ($3, $4),

2 - Loop through the data in batch size

3 - Generate a dynamic sql for M rows where M is whatever's left over

Wrap it in a transaction. You can have an SQL builder where you give it the
list of columns and how many rows you want to generate and use that for both
steps 1 and 3. You can prepare statement 1 (and keep reusing it), but I'mnot
sure how much that'll buy you.

If you need to deal with conflicts. Inserting the above into a temp table (on
commit drop) and then doing an insert + update from the team into the real
table works well.

Then there's a handful of postgresql configs to tweak, but if you're willing
to risk a bit of data loss (but not corruption), asynchronous_commit=off.

------
kristiandupont
I am looking for ways to quickly load data into a PG database for local
testing. An initial seed run is ok, but I want test suites to run on identical
data sets and since PG doesn't (yet) support nested transactions, I a
currently use my seed database as a TEMPLATE for suite-specific databases.

It works reasonably well but creating the database with a template takes
several seconds, even though my test data is fairly simple. I imagine it
should be faster than anything I can do from my programming language (nodejs
in my case), or is creating with templates slow for some reason that I don't
know of?

~~~
aargh_aargh
Not an expert, but I'm curious - what use case is not sufficiently solved by
savepoints?

~~~
kristiandupont
They might well be faster, I haven't tried. I just remember looking into them
and thinking that it was more involved than what I had now so I decided
against them.

------
drej
I concur - we load a fair amount of data into Postgres using Python and we've
learned these lessons the hard way. We ended up feeding generators into
execute_batch.

We don't use execute_values, because we want to generate dictionaries (since
there are usually multiple generators along the way) and we don't use
copy_from, because the cognitive overhead of data prep is just way too high
for the benefit (for us! your mileage may vary).

Great stuff, I've already circulated it at work.

~~~
C1sc0cat
For old school CSV and Excel that requires any processing before inserting I
Have found Perl with the Relevant Modules from CPAN for CSV and Excel is the
best solution

------
mhd
Seems like a pretty good overview, but I'm not surprised at all that in the
end it's COPY TO/FROM that won out by a good margin.

Listing all the columns in order seems a bit superfluous and error prone,
though. I'd recommend creating the list of columns to be inserted (or copied;
copy_from has a 'columns' arg) and slicing the input hash with the same list
(dict comprehension in Py).

------
tda
Very nice article! Explores all the tricks I know of, and a few more. Never
heard of the execute_values before, only a bit (3x in the example) slower than
using copy but much simpler code.

------
da_chicken
Why do you drop and recreate an unlogged table? Why not create a temporary
table, which is also not WAL-logged and automatically gets cleaned up? It
looks like you're not closing the connection. The only advantage is that the
staging table remains in place until the next execution, but presumably so
does the data file.

Alternately, if you have an unlogged table already created, why not simply
truncate the table if it exists instead of recreating it on every execution?
Drop and truncate should take about the same amount of time, but creating a
table is going to be slightly more expensive. If you're worried about making
sure the db table schema matches the file format, I don't think I would be.
File formats change, you're already going to be manually working on it. You
can drop the table so the updated script recreates it.

I suppose it doesn't really matter, but I'm wondering if your choice was one
of intent or one of convenience.

------
cridenour
I followed a similar path with TEMPORARY tables and using pgcopy [1] to insert
data quickly - the copy operation was extremely quick and didn't require only
passing strings.

[1] [https://github.com/altaurog/pgcopy](https://github.com/altaurog/pgcopy)

------
derefr
You can go faster than this. (Mind you, the following suggestions will only
matter if the DB is still your real bottleneck. Once you’re using COPY, you
start to have to optimize the rest of your ETL pipeline as well.)

I’ve done all of the following (except the last bit about pg_bulkload) in my
own ETL stack. They work!

——

1\. COPY supports a binary format in place of CSV. It’s proprietary (it’s not
even exposed in libpq, only statically linked into the postmaster, pgdump and
pgrestore binaries!) but also pretty easy to implement. Here’s a python
package implementing it:
[https://pypi.org/project/pgcopy/](https://pypi.org/project/pgcopy/). As long
as you’ve got the rows in memory at some point, just generate that instead of
CSV. The rest of the architecture is the same.

2\. Run concurrent SQL COPY commands on separate connections for each
[staging] table you’re populating. This allows PG to populate the tables in
parallel.

3\. Create (or TRUNCATE) the staging tables in the same transaction that
contains the COPY. This way, the COPY won’t need to operate on a differential
MVCC state, but rather can operate on the “raw” table. (And, if this is the
“base” data load for a given table, rather than an incremental one against an
already-populated table, then you don’t need a staging table; the target table
itself is now your staging table. Just ensure that any CREATE INDEX or CREATE
PRIMARY KEY or ADD CONSTRAINT happens _after_ you populate the table with
data. No need for a TEMPORARY/unlogged intermediate; the performance would
actually be worse if you added one.)

4\. Pre-bake your .sql, embedding the `FORMAT binary` data, _then_ stream it
once you’re done. I can’t tell you how many times the problem with an ETL
pipeline wasn’t PG’s ingestion speed, but rather the speed at which the ETL
transformer process was writing to the socket. The rest of your pipeline can
be asynchronous, but actually streaming into your PG instance takes up
valuable shared resources on the instance. Assuming that the instance being
ETLed to isn’t just a replication master, and actually receives read traffic,
you’ll want writes to that instance to complete as quickly as possible—even at
the expense of making the rest of your ETL pipeline more complicated. So pre-
bake your .sql files, and then stream them. (One advantage, once you do this:
you don’t need a runtime with an SQL client library any more. You can just
spawn instances of `psql foo.sql`. These processes don’t even have to coincide
with your transform stage any more! Push your `.sql` files to object storage,
and then run `psql` on a dedicated loader VM—or from an object-storage-
lifecycle-event-triggered Lambda function, if you like.)

Fun fact: when you `pgdump --format=directory`, pgrestore(1) will do
techniques #1 - #4 for you automatically.

Key insight: the pgdump(1) and pgrestore(1) pair of binaries are optimized to
heck and back. If pgdump(1) or pgrestore(1) has some special way of doing
something, it’s probably for performance; look into doing that thing in your
pipeline. (Even if there’s no library for your runtime to allow you to do that
thing. Write one!)

——

You can go faster than even pgdump+pgrestore will “naively” do, while still
being compatible with “standard Postgres”, if you’re willing to mangle your
data architecture a bit. Specifically, you can take advantage of Postgres’s
table partitioning support. (Yes, now you have to deal with partitions. But
hey, you probably already had to at the data sizes we’re talking about.)

5\. Turn your incremental loads into table foo (via fresh staging tables) into
base loads of fresh partitions of table foo. Each load creates its own
partition. Techniques #1 and #2 will accelerate and parallelize the loading of
sibling partitions of a parent table, just as well as they’ll accelerate the
loading of separate tables.

Once you do this, you don’t need _any_ staging tables, because every load is a
“base” load of a table. (Yes, you’ll have a lot of tables. You’ll need some
agent sitting around to do async rollups using `CREATE TABLE AS ...` to
consolidate these. If you have access to the TimescaleDB extension, you might
be able to trick it into doing this for you automatically.)

#6 (or #3b). Include your CREATE INDEX statements in the same transaction that
CREATEs and COPYies into the partitions (after the data is populated, though.)
One of the worst thing for the production performance of a data warehouse is
contention between application reads and an index rebuild. If you always
create new partitions, and only ever index them when you’re creating them,
then your indices will be created against tables that aren’t visible to new
clients yet, and so clients and the indexer will never get in one-another’s
way.

——

If, after all this, you’re looking for the absolute fastest way to load data
into Postgres, _and_ you have control over your PG instance (i.e. it’s not a
hosted DBaaS), you could try
[https://pgxn.org/dist/pg_bulkload/](https://pgxn.org/dist/pg_bulkload/). It’s
a PG extension that serves as a faster replacement for the COPY command, by
bypassing most of the buffers and locks related to PG’s individual request-
handling. It shouldn’t cause any problems with MVCC, but only because it
assumes/requires that you’re also doing all of the above, so you’re never
trying to load into a table users can “see.” (I haven’t tried it personally,
but if PG’s ingest speed becomes my bottleneck again, I just might.)

------
Bishonen88
Somewhat related: how does one insert a Pandas DataFrame into a Postgres
table, in which most fields are merely integers being foreign keys of other
tables? The DataFrame itself contains the real values (e.g. Cities, Countries
etc.).

~~~
dekhn
i use write_sql but in this case you would have to resolve the object values
to their integer primary keys.

------
2rsf
Very nice and educational read ! one note though, it's usually a good practice
in performance testing to repeat your measurements more than once, it will
mitigate local problems or at least point to problems in the measurement.

------
zarathustraa
Would also like this with asyncpg. psycopg2 is kind of obsolete today.

~~~
jsmeaton
Bold claim, since psycopg2 is used overwhelmingly more than any other library.
Though asyncpg (or something like it) is likely the future, it’s a long way to
go before calling psycopg obsolete.

~~~
Stormwalker
Asyncpg is not future, it is present. We use it in production and quite happy
with it.

~~~
da_chicken
That still doesn't really challenge parent's claims.

------
cgio
There’s also copy_expert in psycopg2. I believe csv with copy_expert should be
the fastest but have not tested in as much detail as op. Great work on the
article.

~~~
mhd
IIRC copy_expert just lets you enter the postgres COPY command that's used to
start the transfer manually, which by itself doesn't change anything.

It does allow you to use the BINARY transfer format, though, and that's quite
often the fastest version, at the cost of a more complicated data
transformation step.

~~~
cgio
Plus freeze for use cases like the example in the post.

