
Boosting the performance of PostgreSQL’s COPY command by dropping indexes - palewire
https://www.californiacivicdata.org/2018/01/25/index-drop-and-copy/?foo=bar
======
Johnny555
Is it not common knowledge that dropping indexes improves database insert
performance?

Of course, that's often not an option when you you're loading records into a
live database that's also getting queries, you usually don't want every query
to result in a full table scan.

This was well known 20+ years ago when I was an entry-level DBA, and I assumed
it was still well known today.

~~~
cuchoi
I think what the article is proposing is that it can be quicker to drop the
indexes and recreate them than to load a lot of data in an indexed table.

~~~
mgkimsal
and I think that's what what johnny555 is questioning - isn't this common
knowledge already? apparently not.

here's another tip, at least on mysql, but possibly other databases that have
memory tables. Import stuff in to memory tables, then insert from the memory
table to a disk-based table. I took a process that was naively importing data
via SQL commands which took close to 24 hours down to around 20 minutes by
breaking it up, chunking imports to memory tables, then copying those to
permanent disk. This was years ago (12?) and mysql is probably better about
insert handling than it was, but that approach (plus the drop/recreate
indexes) meant this was a smallish process vs a 24 hour import cycle.

~~~
mbrumlow
> apparently not.

Just becasue one blog post making it seem that way makes it not known ?

~~~
mgkimsal
not just because of that - cuchoi also seemed to imply that.

relatedly/anecdotally, I still run in to people who aren't aware of this, as
they don't understand what happens when you're inserting data in to a database
or what indexes are in the first place.

------
drblast
I must be getting old.

Kids, many years ago, even before jQuery, software would come with
documentation that you could read and it would tell you how to use it
effectively.

I know, crazy right? But to this day some of that old software, of which
PostgreSQL is an example, still has this documentation that you can read, even
before you use the software in a production system.

Yeah, yeah, I know Agile and Docker solved the problem of ever having to
document anything, but this is the way things used to be and a few of us are
stuck in our ways and still like it.

~~~
jeremiep
And then people wonder why Agile never produces quality results.

Intellisense has replaced the need to read the docs and Agile has replaced the
need to understand what you're doing.

Its no surprise that basic knowledge found in the documentation is later
"discovered" when the project is already running in production.

~~~
cookiecaper
To be fair, the honest among us will admit that it's not unusual to miss
something glaringly obvious for an embarrassingly long amount of time.

The difference is really in whether you recognize the issue and quietly hope
no one finds out how dumb you really are, or whether you make a big
celebratory blog post about the secret behind your "pioneering" work, making
sure that your title and first and last name are clearly attached. And of
course, we can't fail to highlight the further brilliance of accomplishing
this marvelous feat by employing "rarely used, low-level" commands from within
the framework's ORM.

Hold on to your butts, because next week he's going to learn that you can
execute commands directly on the server, without even having to use the "low-
level" elements of an ORM! I can't wait for the field to be revolutionized by
Lead Developer James Gordon's next discovery.

~~~
nsxwolf
Maybe cut Lead Developer James Gordon some slack. He may not have wanted this
attention, which came from Ben Welsh. Welsh describes himself as a "data
journalist" and a "hack computer programmer". Perhaps he just thought this
improvement made by the developer was really cool, and wanted to blog about
it, not understanding the snark's nest he was stepping into.

~~~
cookiecaper
You're right, of course. I considered mentioning the possibility that this was
meant more as a test PR style or tongue-in-cheek "Haha this is obviously super
important"-style post, since these are definitely viable explanations, but I
felt the rant was already long enough and I couldn't find an easy way to work
it in.

One of the most dangerous things about sharing stuff with others, especially
isolated items from unknown authors with a worldwide audience, is that you
never really know how much of their own context the recipient will read in, or
how much of the assumed / pre-requisite context they'll fail to infer (or
infer differently than intended).

You only get better at this through repeated practice, but you can't ever be
perfect at it. Especially in a world of complex social interactions where
people don't always mean what they say or say what they mean, and the lack of
body language and facial expression in written language silently corrupts the
signal.

All readers should always remember, it is easy to criticize. It is much harder
to do. Critics especially need to remember this, because it's very easy,
automatic in fact, to fall into a pattern of judgment and criticism when we're
regularly exposed to so much stuff from so many sources. But it's good to get
out there and try, because it's often much harder than it looks, and
especially if you've been on the sidelines judging for a long time, it can be
jarring how much harder it is to _do_ than to _say_ (or, particularly,
deride).

A great way to test this: if there's some radio program you listen to
regularly where callers can share a brief anecdote or story, call in. As a
regular listener, you've been silently evaluating callers on a daily basis for
years. You likely have some opinion about the practices of good callers and
bad callers. Call in and you'll be surprised how nerve-wracking it can be,
even for someone as "experienced" as yourself, and I think you'll be
disappointed in your overall performance (unless you've _thoroughly_ rehearsed
ahead of time).

This is just a tiny, irrelevant thing that most people wouldn't give any
thought to, a quick ~60 second phone call. You certainly don't give much
thought to it every day when you dismiss anecdotes or stories told by
amateurs. But try it yourself and you'll get a great deal on some perspective
for the difficulty gap between doing v. criticizing.

The peanut gallery can, and will, always find something to nitpick. Don't take
it personally. Use that data to hone your interactions and get a better-tuned
result next time.

------
cookiecaper
Wow. You've gotta love the audacity in making a big announcement like this
based on the developer finally getting around to _reading the docs_. "Drop
constraints and indexes for faster imports" is mass import 101.

The entirety of the "Why We Did It" section:

\-----

> This improvement was pioneered by James Gordon, the Coalition’s lead
> developer.

> He drew instruction from PostgreSQL’s official documentation, which reads:

>> [snipping quoted sections from PostgreSQL manual at
[https://www.postgresql.org/docs/10/static/populate.html#POPU...](https://www.postgresql.org/docs/10/static/populate.html#POPULATE-
RM-INDEXES) ]

>Gordon’s code handles this task using rarely utilized, low-level tools in
Django’s database manager.

\-----

Sadly, in the current day and age, a developer actually taking the time to
RTFM may indeed qualify as "pioneering" work!

Perhaps the rest of us need to start trumpeting our accomplishments when we
find some clearly-stated performance gain in the manual, rather than hiding
our heads in embarrassment for not finding out until we released version 2.2
of our mass DB import tool.

~~~
larzang
Heads are going to explode when they discover the arcane rarely-utilized tool
known, nay, merely whispered of, only as "transactions".

~~~
mulmen
Just today someone asked me "what's MVCC[1]?". One of today's lucky 10,000
[2].

[1]:
[https://en.wikipedia.org/wiki/Multiversion_concurrency_contr...](https://en.wikipedia.org/wiki/Multiversion_concurrency_control)
[2]: [https://xkcd.com/1053/](https://xkcd.com/1053/)

------
rosser
Do you have any numbers on how much _extra_ time (that is, time spent
servicing queries above the normal query times when your tables are indexed)
application queries take after the loads, but before the index rebuilds are
complete?

If so, how does that compare, in aggregate, to the time saved in the loads?

Or are you simply not putting the application back into service until the
index rebuilds have finished? How long does that take, compared to the time
saved?

EDIT: I'm mostly asking these questions to nudge people to think about them in
the course of trying this in their own environments. It's my day job to think
about these kinds of things; I've worn the PostgreSQL DBA hat for over a
decade now.

~~~
whistlerbrk
Yes, you would never do this unless writes were completely disallowed to your
app. You will save time building the indices in one-shot.

~~~
rosser
My point is: this is not _necessarily_ true. The Fine Article even mentions
that the benefits are situational.

------
jacobkg
Do you also disable autovacuum while this is running? That is another good
trick for speeding up large database imports.

~~~
protomyth
I would, since it would be a waste on a table loaded right after a create.

------
p2t2p
N̶o̶ ̶s̶h̶i̶t̶,̶ ̶S̶h̶e̶r̶l̶o̶c̶k̶!̶ You don't say?

~~~
cuchoi
The article is proposing is that it can be quicker to drop the indexes and
recreate them than to load a lot of data in an indexed table.

~~~
segmondy
It's not a proposal, it's common knowledge. INSERTS causes the index to be
rebuilt. You have to search and find the right location then insert the new
pointer. If you do 1,000,000 inserts that's 1,000,000 searches and writes to
the index.

~~~
ninkendo
Sure INSERTs do, but I personally would have guessed COPY does its work in a
single transaction, which under the scenes I would hope would avoid rebuilding
the index until the transaction is committed?

I admit there’s no reason to expect this other than “so you don’t have to do
the dumb drop/recreate indexes” trick... maybe I just expected COPY to be
smarter than just expanding into a set of insert statements.

------
karlmdavis
From personal experience: PostgreSQL's COPY commands aren't really all that
performant, indexes or no.

Our project saw SIGNIFICANTLY better performance with batched multi-threaded
INSERTs. If you can run a few hundred load threads and manage the concurrency
correctly (not trivial), it will chew through big loads like a monster.

If I ever have the time/excuse, I want to go back and try a multi-threaded
COPY. But if you need speed and have a choice between multi-threaded INSERTs
or a single-threaded COPY, go with the INSERTs every time.

~~~
davidgould
COPY of course is single threaded. If you can split your input data and run
multiple copy operations you can get similar increases although I would never
suggest hundreds of threads. Depending on your IO system something like twice
as many threads as CPU threads is probably going to work better.

On a one for one basis COPY IN will be faster than inserts:

\- COPY uses a special optimization in the access method: instead of running
the full insert logic (find a target page, lock it, insert, unlock page) per
row, it batches all the rows that will fit on the target page.

\- COPY overall has shorter code paths than regular inserts.

~~~
imtringued
Why not just use regular batched inserts with unnest to turn several array
parameters into a table instead of using some arcane hard to use SQL command?

------
rtpg
Why is everyone acting so pissy about the fact that someone happened to find a
performance improvement trick by reading the docs? Isn't this what's supposed
to happen?

None of you read all the performance "tricks" to Postgres before writing your
first SQL statement.

Every day, somebody's born who doesn't know how to boos the performance of
COPY by dropping indexes.

~~~
gaius
Most people don’t brag about being “pioneers” for having read the docs for a
basic function. That’s what’s causing all the mirth.

------
beached_whale
It looks like the article is directed at those doing bulk uploads to their
system and not at developers necessarily. This may not be universally
intuitive to people outside of Database admins and developers.

------
Theodores
They should have stuck with bog-standard mySQL to get this time saving for
free - if you restore a SQL dump created with phpMyAdmin or mysqldump then all
the disable index commands are already in there, good to go, and in SQL.

Whoever wrote the Django bit didn't really do a good job on the defaults.

~~~
ketralnis
I don't know what makes mysql more "bog standard" than postgres, but
pg_dumpall takes `-F format`, one of which is standard SQL statements.

But that's not what they're dealing with. They're dealing with CSV, presumably
from some external source. It'd also be faster if they were dealing with pre-
formed database files that they could just rsync. But they're not.

------
keredson
yep

------
j_s
Yikes [https://www.xkcd.com/1053](https://www.xkcd.com/1053)

