
Some SQL Tricks of an Application DBA - samokhvalov
https://hakibenita.com/sql-tricks-application-dba
======
munk-a
All of those tips and no mention of MATERIALIZED VIEWS[1] it's a pretty
wonderful tool to use for data caching (potentially highly safe data if you
use on action trigger based refreshing) that can allow you to have a nice
normalized data layout to write into and a beautiful de-normalized cross-table
index supporting place to read from. If you've never heard of mat views please
do look them up[2] and play around a bit.

1\. As they're called in postgres at least

2\. [https://www.postgresql.org/docs/current/sql-
creatematerializ...](https://www.postgresql.org/docs/current/sql-
creatematerializedview.html)

~~~
gerhardi
Materialized views are also nice for many modern DW applications where you
don't live in legacy "once in a hour/day/week" ELT/ETL environment, but more
in a real timeish or micro batch world. You can have data marts / reporting
tables composed with these materialized views over many kinds of underlying DW
models. Just dropping this to promote the Materialized View functionality :)

~~~
chishaku
dbt (data build tool) is awesome for managing this.

~~~
aargh_aargh
[https://www.getdbt.com/](https://www.getdbt.com/)

------
gav
I'm not sure if it's just me, but when I see DELETEs that use CTEs like that I
start to get nervous. It's easy to get wrong and it's hard to undo.

I like to have an intermediary table where I can verify what's going to be
deleted (or updated) before it runs. Even better it lets you save the mappings
of old ids so when you find out some downstream system was using them you can
fix that too.

I know it's just an example but soft deletes solve a huge range of problems.

~~~
gmfawcett
Rule of thumb for me is that every DELETE or UPDATE query should start life as
a SELECT query.

The eight stages of DELETE:

\- SELECT what you're deleting;

\- fix the mistake in the query (there's always one);

\- SELECT again;

\- BEGIN TRANSACTION;

\- DELETE;

\- SELECT to confirm the right deletion -- twice for good measure, and maybe
ROLLBACK once or twice if you're feeling twitchy;

\- Hover finger anxiously over the un-entered COMMIT command for several
seconds while resisting Dunning–Kruger effect;

\- COMMIT. :)

~~~
benjohnson
Here's my insanity

Give the table an DeletedTimestamp and DeleatedReason column.

Mark the records as deleted.

Run around like a lunatic and fix queries so they don't show "deleted" columns
to the rest of the app.

Make a page in the application where people can "restore" deleted items.
Because they do stupid things too once in a while.

~~~
niccl
My fix for this is to rename the table to something like _b_<table_name> then
create a view <table_name> that excludes the 'deleted' values (based on the
value in DeletedReason column). That way you make one change to the structure
when you create the column, and everything else Just Works

~~~
munk-a
We use the same approach and specifically use a "...withdeleted" suffix for
the table - you got a table of containing widget objects there? how about a
"widgetwithdeleted" table.

------
ak39
Please add to this otherwise excellent list:

LATERAL JOINs (PGSQL). In SQL Server & Oracle they are CROSS APPLY & OUTER
APPLY.

This will change your SQL habits forever. Lateral joins give you the power of
iteration in set-based operations that are mind-numbingly easy to implement
and understand.

~~~
rrrrrrrrrrrryan
I think this is really only helpful in terms of readability if you come from a
world where iteration is the norm, and really hurts readability for most
database folks where set-based operations are the norm.

~~~
sbuttgereit
Long time database person here (~20yrs), probably very much akin to the
"Application DBA" in the article.

The solve some classes of performance related issues, especially if you're
joining to certain kinds of group-by sub-queries or need to join on the output
of a function. If your surrounding query is defining the constraints, you can
push that into otherwise unconstrained sub-queries.

There was a minor learning curve, it took me a couple hours for it to be
worked into my default mindset. In the end, wonderful technique that can save
dropping into procedural code in any number of cases. The biggest danger is
that it is a kind of iteration and where the performance can be greatly
enhanced in say OLTP style queries, but it may not be a net gain in larger
unconstrained queries that will cause that sub-query or function to get hit
many times... I think getting that into your head is the larger part and the
declarative nature of SQL will cause that to be a little outside of your
vision no matter your background.

------
ZinniaZirconium
Things I learned while making a database of songs on Jango Radio:

1\. Create covering indexes sorted by each column to be searched.

2\. Enable sqlite_stat4

3\. Use WAL mode.

4\. Set a busy_timeout

5\. Vacuum and analyze daily.

6\. Nobody has heard of Jango.

~~~
munk-a
Covering indexes are a really careful thing to investigate, sometimes you'll
get big wins out of it but often times those indexes can easily be misaligned
with really minor application changes so it's something I tend to approach in
very limited situations.

Vacuum & Analyze are two fun ones and I think postgres in particular would do
well by investing a bit more into the default configuration of autovacuum
since the out of the box settings are pretty conservative compared to the
state most out of the box DBs are used - but I think that DB configuration is
very much an unsolved problem in general and can be a quite surprising source
of major performance gains (or losses!).

I mentioned mat views elsewhere - but if you are really hitting quite
predictable usage patterns (where covering indexes are stable enough) then
some well defined mat views might go a long way to tuning performance that
little bit more - just be aware of the fact that rematerialization isn't
instantaneous and if you have a high rate of change in the data then you'll
probably want time based re-materialization instead of trigger based which
means you'll need to account for some inconsistent data (as is the case with
caching in general).

~~~
ZinniaZirconium
I wish there were different recommended configurations for SQLite when used as
the backend of a web site versus the storage format for an embedded system.
The defaults are really not suitable for concurrency or for tables with
millions of rows. Also I wish I didn't have to recompile to enable
sqlite_stat4 histograms.

SQLite doesn't have materialized views.

------
simonw
This post was great - I've been using PostgreSQL for different projects for
fifteen years and there were a bunch of neat tricks in here that I haven't
seen before. Thanks very much.

------
btilly
Normally I read these and expect to learn nothing applicable. But the UNLOGGED
table tip was exactly what I needed for the ETL project that I'm working on.

Thanks!

------
ca_parody
I would also add, for data-science esc tasks (or bulk queries in general) -
COPY {} TO is unreasonably fast - often much faster than executing the
standard select (especially if that select is being executed by a driver in a
slower language).

_

[0] [https://www.postgresql.org/docs/9.2/sql-
copy.html](https://www.postgresql.org/docs/9.2/sql-copy.html)

~~~
haki
See this

[https://hakibenita.com/fast-load-data-python-
postgresql](https://hakibenita.com/fast-load-data-python-postgresql)

------
Izkata
Hot damn, looks like CLUSTER is exactly what I need to handle a pending
performance problem! I'd searched for this before, but not knowing the
keywords just ended up with people saying "add an index", which was not
useful:

We have a table of denormalized data that's used as one of the inputs to an
ETL system, and it has to be sorted by one of the columns in the table. This
is the table's only purpose. Unfortunately, postgres refuses to use the index
because random-access across the entire table is slower than loading and
sorting it all, when retrieving all data in the table. This means the ETL
system has to wait for about 15 minutes before any data is returned.

Some preliminary tests on a copy of the database indicate that, using CLUSTER
to order the data, it will do an index scan instead of a the table scan+sort,
and start returning data immediately. As long as the correlation remains high
enough with regular updates, I'm thinking maybe this can be a scheduled once-
a-week operation...

~~~
durkie
Highly recommended to check out the pg_repack extension for this situation.
CLUSTER requires an access exclusive lock, preventing anything from reading or
writing the table during the cluster operation. pg_repack works around this by
creating new tables with your data, clustering those, recording/replaying all
of the transactions that happened on the original tables during that time, and
then renaming everything (something along those lines). Very noticeable speed
improvement and reduction in read loads when I did it last.

~~~
Izkata
In my tests earlier, CLUSTER only took about 5 minutes (8 if you include
creating the index (which would already exist) and an ANALYZE afterwards), and
the only things inserting into this table are background celery processes. It
would also be offset from when ETL runs by half a day, so I don't think
there'll be any performance problems.

My goal for now is minimal effort to set up, so it takes minimal maintenance
later on, but I'll keep pg_repack in mind in case something weird does happen.

------
plasma
A few other tips:

1) A reminder that when you have a foreign key, such as Order table has a
UserId column, and UserId has a FK to User -> Id table, there is an implicit
lookup that will happen on the Order table if you attempt to delete a User, to
validate its deletion wont violate the Order UserId FK.

For example: DELETE FROM User WHERE Id = 1

This will require the database to first check that won't violate a FK on
Order.UserId:

\-- Implicitly (something like this) is done to validate UserId = 1 is not in
use on Order.UserId that has a FK to User. SELECT Count(*) FROM Order WHERE
UserId = 1

This can lead to poor update/delete performance on the User table (if the row
was deleted or the User.Id was ever changed-- a convoluted example), unless
you remember to place an index on Order.UserId column to help with FK
enforcement.

2) If you use SQL Server, and you have a large table, and a few times a month
for some reason queries that were working fine yesterday suddenly timeout, but
then start to run quickly again for a while, its probably because of SQL
Server suddenly deciding to recalculate statistics for its optimizer, and the
query that is timing out, is the lucky query chosen to require the
recalculation to happen first.

The recalculation usually involves SQL Server scanning the table before
proceeding ahead with your query, and if the query times out, its re-
computation of stats is rolled back (so it keeps failing, until it eventually
gets to complete the query).

A workaround that has worked for us is to enable asynchronous statistics, so
that the re-computation won't block the query that happened to trigger the re-
computation in the first place:

[https://www.mssqltips.com/sqlservertip/2904/sql-servers-
auto...](https://www.mssqltips.com/sqlservertip/2904/sql-servers-auto-update-
statistics-async-option/)

------
devmunchies
I was expecting a punch line on that comic strip. nothin.

Regarding the "Always Load Sorted Data" section, it says to insert the data
into the table so that the column you want to select on is already sorted,
thus having a correlation of 1.

What if I'm adding a single record to the database, how do you re-sort them
all upon insert?

~~~
gmfawcett
By "load" I'm pretty sure they mean "insert initial data into an empty table."
So a different use case.

In ETL work, for example, it's not uncommon to truncate and reload source and
staging tables on each batch run. So the tip would be helpful in those
scenarios, more than in an OLTP case (insert this record, update that one).

------
forinti
First trick shows that update and delete should have a special syntax for when
you really mean all rows:

    
    
       UPDATE users SET email = lower(email)
    

This should return an error: "do you mean all rows?"

    
    
       UPDATE users all rows SET email = lower(email)
    

Or something like that.

~~~
kogus
I've often thought the WHERE clause should be required. You could just say
WHERE 1=1 if you really meant all rows.

~~~
Izkata
This is a settable option in mysql:
[https://dev.mysql.com/doc/refman/8.0/en/server-system-
variab...](https://dev.mysql.com/doc/refman/8.0/en/server-system-
variables.html#sysvar_sql_safe_updates)

~~~
pmiller2
That option is slightly less convenient than just requiring a WHERE or LIMIT
clause, because it specifies that the WHERE clause has to have a key
constraint. You can still do approximately the same as the 'WHERE 1=1' trick
with something like this:

    
    
           UPDATE users SET email = lower(email) LIMIT COUNT(email)
    

if you want. I kind of like that, because it requires you to be more explicit
than just putting 'WHERE 1=1'. It seems like mistakes would be fewer, too,
because putting LIMIT COUNT(email) isn't just some default boilerplate you can
get used to typing.

~~~
uhoh-itsmaciek
Doesn't that skip counting null emails (though not filtering them for
lowercasing), meaning you may not lowercase all emails if you have any null
emails?

~~~
NicoJuicy
Nice catch, Google searched it:

> Not everyone realizes this, but the COUNT function will only count the
> records where the expression is NOT NULL in COUNT(expression) . When the
> expression is a NULL value, it is not included in the COUNT calculations.

------
onnnon
Well written and informative, thank you for sharing.

------
kmoppel
Nice list, but maybe worth mentioning that the slightly dangerous "invisible"
index suggestion can in many cases be avoided with planner hints if the query-
under-tuning only touches one index. So a much better approach would be smth
like:

SET enable_indexscan TO off; SET enable_indexonlyscan TO off;

------
arethuza
That gave me flashbacks to using a dreadful ORM years ago that was incredibly
bad at doing updates. Combine that with code generated from UML diagrams
resulting in tables with hundreds of columns and every time you updated a
single field _every_ column got written out.....

~~~
pachico
ORMs should probably be treated as technical debt in the sense that they help
you deliver something more quickly but force you to refactor your code once
you need scalability/performance.

~~~
giovannibonetti
I would argue otherwise. The way I see it, technical debt is code that is more
complicated than necessary for a feature. On the other hand, a good ORM makes
your app code less complicated.

You only need to refactor to use raw SQL in the rare edge cases where you need
more performance. At least working with RoR I find these cases extremely rare
(less than 1% of the cases by my estimate).

~~~
mgkimsal
my 'go to' recently has been - use the basic ORM if I'm dealing with one (or
maybe a small handful) of things where I'm doing work local to the request on
those specific items.

use _just_ a query builder that returns a raw array for times when it's just
being shipped down in JSON to a client. The overhead of full ORM - converting
to full objects internally - often for hundreds or thousands of records - is
not worth whatever small conveniences might be afforded via a full ORM usage.

------
say_it_as_it_is
The role "Application DBA" doesn't seem right. Is this common?

~~~
Izkata
I haven't heard it by name, but the description is something I've heard of
before - usually when app devs are complaining about only being able to
query/update the database through provided views/stored procedures, so it
becomes a pain to support new features.

I remember one story in particular where the app devs discovered one of the
stored procedures could be manipulated to run arbitrary SQL, so they started
using that instead of talking to the DBA...

~~~
mgkimsal
war story:

worked in at least 2 different shops with the same restriction: "developers
can never touch production database".

actually, it's been that way in _most_ places large enough to have enough
staff to separate things out. however, in most of those places, a dev or two
would still have access to use, whether for emergencies or debugging or
whatnot.

In 2 places, there was an enforced rule, where the DBA (in both cases, one and
only one person) had the sole password/credentials for production. "Devs can't
ever be on production, they could write random SQL - everything has to be
tested". The person who said this would _routinely_ write his own sprocs on
the DB - on production only - without tests. he acted as a support person for
the owners - he'd just write sprocs and custom query on production only for
any reports or data updates owners and upper mgt wanted. But... those sprocs
were never documented, tested, or pushed back to dev, so we never knew what
was running there. If we asked for a new table, or new index, or whatever,
that might cause some trouble for his hidden/untested code, we'd get blanket
denials. "that causes a performance bottleneck - won't do".

The "complain about the DBAs" issue is real, but is just symptomatic of bad
culture all around. It's just as bad to have any cowboy running rampant
without any visibility in to what they're doing. But sometimes (often?) DBA
folks seem to escape this scrutiny (perhaps just at smaller shops?)

------
info781
Usually this job title is development DBA or PL/SQL developer.

