
Prepared statements and their surprising performance implications - panic
https://blog.soykaf.com/post/postgresql-elixir-troubles/
======
hinkley
Prepared Statements are a place where 'correct' and 'fast' intersect in a big
way. If all databases would detect when you're done with them and close them
automatically, they'd be just about perfect. But you can't have everything I
suppose. A few programming languages have designed or re-designed their APIs
to try to detect this at the library level, and I appreciate the efforts to do
so.

I was stunned to learn that the most popular database mapping tool in NodeJS
did not use Prepared Statements. Due to an accident of their compatibility
matrix it was going to be tricky to achieve, and so they had kicked that can
down the road several times.

But in the meantime they had done just heaps of work on input validation to
prevent SQL injection attacks. They were confident they had no attack surface.
Bunches of test cases to prove it.

It was sort of impressive but also really sad, because creating Prepared
Statements with bound variables gets all of that for free, and they're much
faster on some DBs. The main exception is for freeform text search, but we
seem to have pushed that problem to external tools anyway (I'm not sure the
two facts are unrelated...)

~~~
SigmundA
In MSSQL prepared statements are not really used because its unnecessary.
MSSQL hashes incoming SQL statements and looks for them in the plan cache and
reuses automatically. It also looks at connections setting and does parameter
sniffing to decided whether to use existing plans.

You can still do prepared statements with MSSQL but its mainly there for
backward compatibility.

I was actually surprised to find PG didn't do this recently and only had
prepared statements and can't share plans across connections.

~~~
mikeyjk
I'm a bit confused to read that because whilst they are hashed for reuse
(requires optimising for ad hoc workloads to do this?), unless the text of the
query is an identical match it will need to compile another separate query
plan and won't reuse the existing plan. This means you can end up in a
situation where you have many ad hoc requests causing significant memory
consumption. Would you be able to speak as to why prepared statements aren't
necessary in MSSQL - I'm happy to link to some blog articles that are arguing
the total other direction.

------
hinkley
Long ago, we brought in a consultant to tell us why we couldn't max out our
very fancy Oracle 9i database hardware. Nothing we did could get us much above
50% saturation on the disk array, CPU, or networking. We were stumped.

I don't know where Oracle is on this issue these days, but in that version
there was a query cache for solved queries. And for a query to execute, it had
to be in that cache. That dictated how many queries could be in-flight at
once. Or rather, how many _unique_ queries. If you used Prepared Statements,
it would be happy to run as many configurations in parallel as you wanted. We
were using Prepared Statements, so that shouldn't have been a problem. Right?
... Right?

And then he showed me the log of queries.

Less than 2 years earlier (oh, you can already tell this is going somewhere,
can't you), a newer team mate was trying to scrub all of the query string
arithmetic from our code. Someone had settled on some sort of Builder pattern,
but this guy was getting stuck. The guy who had instigated this had to tag out
for reasons that I cannot recall, so he asked me to sub in. I got a little
briefing, then we went through the problems and I sorted him out, and then I
very carefully explained to him what we expected of this code.

If you're gonna bind variables, bind all of them. Solving part of a
combinatorics problem isn't solving it, solving part of a SQL injection
problem isn't solving it. Bind. Every. Variable. Got it? Got it. You good? I'm
good.

And then, dear friends, I made a fateful mistake: I went back to my other
tasks and fires, and the other engineer oversaw the rest of the work and the
code review.

18+ months later, I'm staring at a bunch of nearly identical queries in the
Oracle query cache that have half a dozen bind variables and then a couple of
distinct AND clause near the end. Plot twist: he was now my boss. The answer
to 'why is our fancy hardware slow?' was, "because of you, Mike, because of
you." Fuck a duck.

I guess maybe the reason he was so awful at giving orders is that he never
learned how to follow them.

~~~
bmm6o
I'm not sure i follow the punchline. If the queries need different where
clauses then they aren't interchangeable. Or are you saying they were
semanticly the same and only syntacticly different?

~~~
hinkley
dantillberg has the right of it.

We were doing some complicated fitness tests on our data to return appropriate
rows to the user.

Actual:

SELECT * FROM things WHERE ... AND columnA>? AND columnB<5;

Expected:

SELECT * FROM things WHERE ... AND columnA>? AND columnB<?;

These sorts of mistakes tend to sneak back into projects that previously had
this handled, so you do have to watch for regressions. (But that's not what
happened in this case, if anyone is wondering.)

I don't know if there are any good ways to audit for these sorts of problems
these days. At one point I joked that I was going to write my own database
library that _only_ accepted Prepared Statements and nothing else, but barring
a full SQL query parser, I'm not entirely sure how I thought I was going to
prevent someone from doing that.

~~~
toast0
You really want this to be an SQL mode that the server enforces. Give me a
query where anything that could be a parameter wasn't and I'll fail the query,
write a nasty log, and maybe disconnect your client.

I think as an opinionated library, you could build a minimal parser, and fail
queries that had values or didn't meet your parsing. I'm not super familiar
with SQL these days, but I'd guess you could make a reasonable subset and have
confidence that you would have false positives, but no false negatives.

~~~
adrianmsmith
Oracle, at least, has/had the problem that it can't use the values of
parameters to affect the execution plan. So sometimes it's good to write in
the values in the SQL statement.

If you write "select .. where account_id=?" then the query plan will probably
be the same for any account_id, e.g. use the index, so that's fine, working as
intended.

However, if you write "... where active=?" and active can only be true or
false and there are a few thousand active entries but tens of millions of
inactive ones, the query plan might be quite different for active=true and
active=false. (Active=true would use an index, but active=false would make
more sense to just sequentially read the whole table, as most rows will be
used anyway.)

So if you have very few distinct values, the selection of the value influences
the execution plan, and the value doesn't come from an untrusted source, it
can be better to not use query parameters.

That is to say, 99% of the time you'll want to use query parameters, but
rejecting queries which don't use them isn't a solution either, as you do need
to avoid them sometimes, unfortunately.

~~~
cneu
That's not entirely correct. A hard parse causes the optimizer to do bind
variable peeking, i.e. it chooses the execution plan based on the bind
variable value.

[https://docs.oracle.com/cd/E11882_01/server.112/e41573/optim...](https://docs.oracle.com/cd/E11882_01/server.112/e41573/optimops.htm#PFGRF94588)

~~~
pif
From the same source:

 _When choosing a plan, the optimizer only peeks at the bind value during the
hard parse. This plan may not be optimal for all possible values._

------
aspett
Oh my goodness, you're my saviour. I spent a whole day last week debugging
what looked like a caching issue with Ecto + Postgres. When the application
ran a query, it'd eventually slow down to 600ms.. when I ran it in psql,
always 10ms. My solution was to convert the query to SQL and run it manually
which bypassed the plan caching Ecto does.

THANK YOU. I will absolutely be testing this tomorrow.

------
throwdbaaway
Looks like ecto went too far by using named prepared statement by default?
Should work just fine with unnamed prepared statement:
[https://github.com/elixir-
ecto/postgrex/blob/master/README.m...](https://github.com/elixir-
ecto/postgrex/blob/master/README.md#pgbouncer)

Postgres planner is very fast, so there will be no noticeable performance
difference by having to plan the same query again and again. This is unlike,
say MSSQL, where a semi-complex query may take more than 0.5s to parse/plan,
thus requiring the rather lousy query plan cache.

~~~
anarazel
> Postgres planner is very fast, so there will be no noticeable performance
> difference by having to plan the same query again and again. This is unlike,
> say MSSQL, where a semi-complex query may take more than 0.5s to parse/plan,
> thus requiring the rather lousy query plan cache.

That's not true. For a simple read only oltp workload (postgres' pgbench -S),
the difference between prepared and non-prepared is significant:

pgbench -M simple -P1 -n -S -T 10 -j72 -c72

...

tps = 585582.911949 (excluding connections establishing)

vs

pgbench -M prepared -P1 -n -S -T 10 -j72 -c72

...

tps = 1010152.078646 (excluding connections establishing)

For more complicated queries the difference can be considerably bigger.

~~~
throwdbaaway
Thanks for the numbers. Can you run it for -M extended as well? That would be
the case when using unnamed prepared statement.

Anyway, as there is no way for a query plan cache to be one-size-fit-all, as
shown by both TFA and some comments in this HN discussion, that slight
overhead is more than acceptable IMO, especially when compared to MSSQL.

~~~
anarazel
> Thanks for the numbers. Can you run it for -M extended as well? That would
> be the case when using unnamed prepared statement.

Those are worse, due to the increased number of protocol messages
(parse/bind/execute vs exec).

tps = 491218.039129 (excluding connections establishing)

> that slight overhead is more than acceptable IMO, especially when compared
> to MSSQL.

I wouldn't call a 2x performance difference slight... That's also with a lot
of context switching overhead - if you have a client using pipelined
execution, the difference also gets bigger.

~~~
throwdbaaway
Good point about the "slight" wording, though I don't necessary agree that it
can get much worse than 2x, with actual workload that:

\- could be more I/O bound than when just doing PK lookups with pgbench

\- could be getting a wrong plan from the query plan cache if named prepared
statement is used

------
tirumaraiselvan
We experienced a similar orders of magnitude of performance at Hasura by using
graphql plan caching and also postgres prepared statements at the same time.

This thread is quite timely as we recently published a post about this here:
[https://hasura.io/blog/fast-graphql-execution-with-query-
cac...](https://hasura.io/blog/fast-graphql-execution-with-query-caching-
prepared-statements#postgresql-prepared-statements)

~~~
lmilcin
Making some query optimization expensive (ie parsing the query in GraphQL,
parsing SQL statement, calculating resource with HTTP call) to make it
flexible but then caching the results to recognize that even with flexible API
any (sanely designed) application will have finite actual queries is oft-
recurring pattern in software engineering.

I wonder when this kind of pattern starts being taught more extensively than a
stupid singleton which is nothing more than a fancy name for a global
variable.

------
smoyer
TNDR (Too Narrow - Didn't Read): I have a really wide monitor and this site's
code is horrible to read as it's compressed to a very narrow strip (less than
600px) of text down the middle of the browser window. I tried ... I really did
and it seems like the content might be worth reading.

------
otbutz
I have also been struggling lately with prepared statements and obscure query
planning decisions. The developers of the mssql JDBC driver seem to be
determined to force the use of prepexec in their driver, which causes high
execution times for typical ORM-generated queries:

[https://github.com/microsoft/mssql-
jdbc/issues/1196](https://github.com/microsoft/mssql-jdbc/issues/1196)

------
angry_octet
I appreciate the use of the re-worked _Pride and Predjudice_ quote, especially
as modern social networking seems just as airheaded and deadly serious as
early 19th century polite society. We in tech are the majordomos of the
stately manion, the scullery maids and chimney sweeps, while all look up at
Dukes Bezos, Gates, Bloomberg and Zuck.

~~~
angry_octet
Y'all are humourless philistines.

~~~
sho
> Y'all are humourless philistines.

For what do we live, but to make sport for our neighbours, and laugh at them
in our turn?

But seriously, I think the intersection of software nerds and readers of
_Pride and Prejudice_ is probably fairly small. I appreciated it too, but can
understand that 99% of people here probably have no idea what you're talking
about. For the curious, the quote in question is _It is a truth universally
acknowledged..._

------
blunte
Great insight! And this situation illustrates why it's useful to understand a
bit (especially understand how to test and explore) deeper than just the
(not)ORM.

~~~
Cthulhu_
I'm getting a feeling of dread now. I'm using an ORM at the moment (Facebook's
'ent' in Go) but I don't know what the underlying queries and performance are
like.

I guess I'll just build my application and review the queries / performance
cost further down the line. He said hopefully.

~~~
blunte
The point is that at some point, it does become necessary to learn the
complexities beneath the interface. However, you don't have to learn it until
you _have_ to learn it. Just be prepared to spend some quality time when you
reach that wall.

