Hacker News new | past | comments | ask | show | jobs | submit login
Some SQL Tricks of an Application DBA (hakibenita.com)
327 points by samokhvalov 16 days ago | hide | past | favorite | 68 comments

All of those tips and no mention of MATERIALIZED VIEWS[1] it's a pretty wonderful tool to use for data caching (potentially highly safe data if you use on action trigger based refreshing) that can allow you to have a nice normalized data layout to write into and a beautiful de-normalized cross-table index supporting place to read from. If you've never heard of mat views please do look them up[2] and play around a bit.

1. As they're called in postgres at least

2. https://www.postgresql.org/docs/current/sql-creatematerializ...

Materialized views are also nice for many modern DW applications where you don't live in legacy "once in a hour/day/week" ELT/ETL environment, but more in a real timeish or micro batch world. You can have data marts / reporting tables composed with these materialized views over many kinds of underlying DW models. Just dropping this to promote the Materialized View functionality :)

dbt (data build tool) is awesome for managing this.

I've found them to be pretty limited as a feature - really they are just some syntactical sugar over "create table as select ... ".

If you start to need any kind of flexibility over refreshing the table, you'll need to move to using a plain old table anyway.

This was my impression too - without incremental update, I don't see the benefit ?

I'm someone with limited DB wizardry, but a large admiration for Postgres so take this for what it's worth:

It's my understanding that there is a proposed feature, with a proof-of-concept floating around, called "Incremental View Maintenance", which is auto-update of dependent Materialized Views when their dependent base-tables change.


  Materialized view with IVM option created by CRATE INCREMENTAL MATERIALIZED VIEW command. Noted this syntax is just tentative, so it may be changed.

  When a materialized view is created, AFTER triggers are internally created on its all base tables.

  When the base tables is modified (INSERT, DELETE, UPDATE), this view is updated incrementally in the trigger function.
I think there are some performance/technical things that are getting sorted out with this, but it would be a killer feature.

It's the one thing I wish Postgres had that it doesn't. You have to use triggers to do this currently.

Yes, this is a killer feature. There are many instances of what are essentially application side caches that try (and often fail) to do this; its non-trivial to do right, and I am greatly looking forwards to it.

In particular, an event-sourced world becomes much easier to implement and maintain with the DB handling all the incremental magic.

For what it's worth SQL server supports incremental updates for materialized views. I think they're called indexed views, though

Yep! And they're pretty magical. Beyond materializing joins, you can use COUNT_BIG to greatly speed up common DISTINCT queries and the two row trick[1] to enforce complex constraints.

[1] https://spaghettidba.com/2011/08/03/enforcing-complex-constr...

Well they are magical until you need to do something a bit more complex, like window functions, where they’ll not work again. I’ve worked around this by using indexed views for sub parts of a larger regular view but it feels a bit hacks since you need to hint the server to actually use the indexed view or it will just use it as a regular view which I don’t understand the reasoning behind.

But for simple things they’re magical.

> Well they are magical until you need to do something a bit more complex, like window functions.

The best method I've personally found for window functions is cross/outer apply + narrow indexes with included columns. A lot of times you can get away without an indexed view at all.

> you need to hint the server to actually use the indexed view or it will just use it as a regular view which I don’t understand the reasoning behind

SQL Server Enterprise will use indexed views automatically. But you gotta shell out the big bucks for that improved query planner.

Please add to this otherwise excellent list:


This will change your SQL habits forever. Lateral joins give you the power of iteration in set-based operations that are mind-numbingly easy to implement and understand.

I think this is really only helpful in terms of readability if you come from a world where iteration is the norm, and really hurts readability for most database folks where set-based operations are the norm.

Long time database person here (~20yrs), probably very much akin to the "Application DBA" in the article.

The solve some classes of performance related issues, especially if you're joining to certain kinds of group-by sub-queries or need to join on the output of a function. If your surrounding query is defining the constraints, you can push that into otherwise unconstrained sub-queries.

There was a minor learning curve, it took me a couple hours for it to be worked into my default mindset. In the end, wonderful technique that can save dropping into procedural code in any number of cases. The biggest danger is that it is a kind of iteration and where the performance can be greatly enhanced in say OLTP style queries, but it may not be a net gain in larger unconstrained queries that will cause that sub-query or function to get hit many times... I think getting that into your head is the larger part and the declarative nature of SQL will cause that to be a little outside of your vision no matter your background.

Not really - this fine SO answer has many examples of CROSS/OUTER APPLY elegantly solving problems that would otherwise have been inconvenient to deal with - https://stackoverflow.com/a/9275865/753731

This is life-changing. Thank you so much.

LATERAL JOINs are also in Snowflake, agree that they are the jam.

I'm not sure if it's just me, but when I see DELETEs that use CTEs like that I start to get nervous. It's easy to get wrong and it's hard to undo.

I like to have an intermediary table where I can verify what's going to be deleted (or updated) before it runs. Even better it lets you save the mappings of old ids so when you find out some downstream system was using them you can fix that too.

I know it's just an example but soft deletes solve a huge range of problems.

Rule of thumb for me is that every DELETE or UPDATE query should start life as a SELECT query.

The eight stages of DELETE:

- SELECT what you're deleting;

- fix the mistake in the query (there's always one);

- SELECT again;



- SELECT to confirm the right deletion -- twice for good measure, and maybe ROLLBACK once or twice if you're feeling twitchy;

- Hover finger anxiously over the un-entered COMMIT command for several seconds while resisting Dunning–Kruger effect;

- COMMIT. :)

Here's my insanity

Give the table an DeletedTimestamp and DeleatedReason column.

Mark the records as deleted.

Run around like a lunatic and fix queries so they don't show "deleted" columns to the rest of the app.

Make a page in the application where people can "restore" deleted items. Because they do stupid things too once in a while.

My fix for this is to rename the table to something like _b_<table_name> then create a view <table_name> that excludes the 'deleted' values (based on the value in DeletedReason column). That way you make one change to the structure when you create the column, and everything else Just Works

We use the same approach and specifically use a "...withdeleted" suffix for the table - you got a table of containing widget objects there? how about a "widgetwithdeleted" table.

until you hit requirements that data must really be gone from the system

at least you need something like DELETE FROM foo WHERE deleted < NOW() - '30 days'::timedelta (or whatever the correct syntax is)

Yep, I do the same. Depending on size of data being affected, level of table RI in place,and necessity of seeing the data again later, or client calling back the following day/week having changed their mind and can we 'undo' that, I will often tack in a select into tablename_changes_jan7_2020 or some such convention to grab the unchanged records. Qty, backup retention options etc also play into this so use some common sense of course. If you worry about space, just set an drop table job up to clear it after a month or whatever. Lots of these CYA tricks if you want to make sure you don't shoot yourself in the foot.

I do the same thing.

However, it won't save you if realize that there's an issue after you commit. Maybe I'm just paranoid but I like to have both a snapshot before a major change, plus I'm fond of having a spare (via delayed replication or log shipping).

Ah, I forgot to include the twelve stages of proactive, redundant backups. :) I usually do the same thing -- local snapshots if the database is small and the app is less critical... or a careful review of the central db backup history and a documented rollback plan if it's a bigger system.

Large DELETEs muck up table statistics and cached joinplans, which is why a lot of enterprise databases have an IsDeleted column on their large tables. Setting things up this way has the added benefit of being able to easily "delete" things without too much anxiety.

Ideally, all SELECT queries are forced through a view layer (which already has an IsDeleted = 0 restriction), and you can just schedule a nightly job which deletes or archives all the rows with IsDeleted = 1, then refresh the table statistics automatically.

It's a good idea to never run hand-crafted data mutating queries against a production DB. Nervousness is fine but CTEs are extremely useful - at our company we insist on automated tests around SQL that will be run against the prod DB and that makes me quite happy. Just make sure the devs prove their CTE is correct before letting it touch the data.

Things I learned while making a database of songs on Jango Radio:

1. Create covering indexes sorted by each column to be searched.

2. Enable sqlite_stat4

3. Use WAL mode.

4. Set a busy_timeout

5. Vacuum and analyze daily.

6. Nobody has heard of Jango.

Covering indexes are a really careful thing to investigate, sometimes you'll get big wins out of it but often times those indexes can easily be misaligned with really minor application changes so it's something I tend to approach in very limited situations.

Vacuum & Analyze are two fun ones and I think postgres in particular would do well by investing a bit more into the default configuration of autovacuum since the out of the box settings are pretty conservative compared to the state most out of the box DBs are used - but I think that DB configuration is very much an unsolved problem in general and can be a quite surprising source of major performance gains (or losses!).

I mentioned mat views elsewhere - but if you are really hitting quite predictable usage patterns (where covering indexes are stable enough) then some well defined mat views might go a long way to tuning performance that little bit more - just be aware of the fact that rematerialization isn't instantaneous and if you have a high rate of change in the data then you'll probably want time based re-materialization instead of trigger based which means you'll need to account for some inconsistent data (as is the case with caching in general).

I wish there were different recommended configurations for SQLite when used as the backend of a web site versus the storage format for an embedded system. The defaults are really not suitable for concurrency or for tables with millions of rows. Also I wish I didn't have to recompile to enable sqlite_stat4 histograms.

SQLite doesn't have materialized views.

Damn you work at jango? I've been a happy customer since ~2009.

It's the best kept secret on the internet. I wish you guys blogged about your engineering, I bet there's lots to say.

Keep rocking

Haha! No I don't work at Jango. Earlier this year I got tired of songs disappearing from Jango song search so I made my own search tool as a quarantine project. Like you say Jango is almost a secret these days and I've been looking for an appropriate forum to discuss it but there pretty much isn't any. I tried Show HN and nothing happened.

So are you actually a paying customer with a Radio Airplay account or are you a free Jango listener? I never had and never needed an account.

I don't know what kind of database Jango uses or what kind of work Jango people do but I sure hope they keep doing it.

Sounds like a GIN index might be better suited for the search case.

SQLite doesn't have GIN indexes.

This post was great - I've been using PostgreSQL for different projects for fifteen years and there were a bunch of neat tricks in here that I haven't seen before. Thanks very much.

Normally I read these and expect to learn nothing applicable. But the UNLOGGED table tip was exactly what I needed for the ETL project that I'm working on.


I would also add, for data-science esc tasks (or bulk queries in general) - COPY {} TO is unreasonably fast - often much faster than executing the standard select (especially if that select is being executed by a driver in a slower language).


[0] https://www.postgresql.org/docs/9.2/sql-copy.html

Hot damn, looks like CLUSTER is exactly what I need to handle a pending performance problem! I'd searched for this before, but not knowing the keywords just ended up with people saying "add an index", which was not useful:

We have a table of denormalized data that's used as one of the inputs to an ETL system, and it has to be sorted by one of the columns in the table. This is the table's only purpose. Unfortunately, postgres refuses to use the index because random-access across the entire table is slower than loading and sorting it all, when retrieving all data in the table. This means the ETL system has to wait for about 15 minutes before any data is returned.

Some preliminary tests on a copy of the database indicate that, using CLUSTER to order the data, it will do an index scan instead of a the table scan+sort, and start returning data immediately. As long as the correlation remains high enough with regular updates, I'm thinking maybe this can be a scheduled once-a-week operation...

Highly recommended to check out the pg_repack extension for this situation. CLUSTER requires an access exclusive lock, preventing anything from reading or writing the table during the cluster operation. pg_repack works around this by creating new tables with your data, clustering those, recording/replaying all of the transactions that happened on the original tables during that time, and then renaming everything (something along those lines). Very noticeable speed improvement and reduction in read loads when I did it last.

In my tests earlier, CLUSTER only took about 5 minutes (8 if you include creating the index (which would already exist) and an ANALYZE afterwards), and the only things inserting into this table are background celery processes. It would also be offset from when ETL runs by half a day, so I don't think there'll be any performance problems.

My goal for now is minimal effort to set up, so it takes minimal maintenance later on, but I'll keep pg_repack in mind in case something weird does happen.

A few other tips:

1) A reminder that when you have a foreign key, such as Order table has a UserId column, and UserId has a FK to User -> Id table, there is an implicit lookup that will happen on the Order table if you attempt to delete a User, to validate its deletion wont violate the Order UserId FK.

For example: DELETE FROM User WHERE Id = 1

This will require the database to first check that won't violate a FK on Order.UserId:

-- Implicitly (something like this) is done to validate UserId = 1 is not in use on Order.UserId that has a FK to User. SELECT Count(*) FROM Order WHERE UserId = 1

This can lead to poor update/delete performance on the User table (if the row was deleted or the User.Id was ever changed-- a convoluted example), unless you remember to place an index on Order.UserId column to help with FK enforcement.

2) If you use SQL Server, and you have a large table, and a few times a month for some reason queries that were working fine yesterday suddenly timeout, but then start to run quickly again for a while, its probably because of SQL Server suddenly deciding to recalculate statistics for its optimizer, and the query that is timing out, is the lucky query chosen to require the recalculation to happen first.

The recalculation usually involves SQL Server scanning the table before proceeding ahead with your query, and if the query times out, its re-computation of stats is rolled back (so it keeps failing, until it eventually gets to complete the query).

A workaround that has worked for us is to enable asynchronous statistics, so that the re-computation won't block the query that happened to trigger the re-computation in the first place:


I was expecting a punch line on that comic strip. nothin.

Regarding the "Always Load Sorted Data" section, it says to insert the data into the table so that the column you want to select on is already sorted, thus having a correlation of 1.

What if I'm adding a single record to the database, how do you re-sort them all upon insert?

By "load" I'm pretty sure they mean "insert initial data into an empty table." So a different use case.

In ETL work, for example, it's not uncommon to truncate and reload source and staging tables on each batch run. So the tip would be helpful in those scenarios, more than in an OLTP case (insert this record, update that one).

First trick shows that update and delete should have a special syntax for when you really mean all rows:

   UPDATE users SET email = lower(email)
This should return an error: "do you mean all rows?"

   UPDATE users all rows SET email = lower(email)
Or something like that.

I've often thought the WHERE clause should be required. You could just say WHERE 1=1 if you really meant all rows.

That option is slightly less convenient than just requiring a WHERE or LIMIT clause, because it specifies that the WHERE clause has to have a key constraint. You can still do approximately the same as the 'WHERE 1=1' trick with something like this:

       UPDATE users SET email = lower(email) LIMIT COUNT(email)
if you want. I kind of like that, because it requires you to be more explicit than just putting 'WHERE 1=1'. It seems like mistakes would be fewer, too, because putting LIMIT COUNT(email) isn't just some default boilerplate you can get used to typing.

Doesn't that skip counting null emails (though not filtering them for lowercasing), meaning you may not lowercase all emails if you have any null emails?

Nice catch, Google searched it:

> Not everyone realizes this, but the COUNT function will only count the records where the expression is NOT NULL in COUNT(expression) . When the expression is a NULL value, it is not included in the COUNT calculations.

I have seen "WHERE 1" in MySQL land. is it the same thing?

Many IDEs do throw up an "Are you sure?" confirmation prompt by default if you try to perform a DELETE or an UPDATE without a WHERE clause. I know some even _require_ you to have a WHERE clause by default (it's a toggleable setting, obviously), which is why you'll see some database devs add "WHERE 1=1" to the end of their queries as a matter of habit.

Well written and informative, thank you for sharing.

Nice list, but maybe worth mentioning that the slightly dangerous "invisible" index suggestion can in many cases be avoided with planner hints if the query-under-tuning only touches one index. So a much better approach would be smth like:

SET enable_indexscan TO off; SET enable_indexonlyscan TO off;

That gave me flashbacks to using a dreadful ORM years ago that was incredibly bad at doing updates. Combine that with code generated from UML diagrams resulting in tables with hundreds of columns and every time you updated a single field every column got written out.....

Sorry to hear about the flashbacks. Having written a couple dreadful ORMs in my day, I feel an odd kind of guilt right now. :)

I know it's a new post, but it's funny to see what a large share of this discussion is about your ORM comment. You've touched a nerve, I think! I also wonder if there should be a corollary of Godwin's law -- "as an online discussion about relational databases grows longer, the probability of diverging into a discussion about ORMs approaches 1." :)

ORMs should probably be treated as technical debt in the sense that they help you deliver something more quickly but force you to refactor your code once you need scalability/performance.

I would argue otherwise. The way I see it, technical debt is code that is more complicated than necessary for a feature. On the other hand, a good ORM makes your app code less complicated.

You only need to refactor to use raw SQL in the rare edge cases where you need more performance. At least working with RoR I find these cases extremely rare (less than 1% of the cases by my estimate).

my 'go to' recently has been - use the basic ORM if I'm dealing with one (or maybe a small handful) of things where I'm doing work local to the request on those specific items.

use just a query builder that returns a raw array for times when it's just being shipped down in JSON to a client. The overhead of full ORM - converting to full objects internally - often for hundreds or thousands of records - is not worth whatever small conveniences might be afforded via a full ORM usage.

The role "Application DBA" doesn't seem right. Is this common?

I've always referred to it as a Development DBA. But the gist of separating DBA's into two camps, I've definitely used and seen used in many places.

I sometimes attribute the modern rise of NoSQL and other databases on the somewhat bad reputation that infrastructure-only DBA's can give to the profession. If the only major thing you do for work is check the backups ran overnight and deny change requests for "stability" reasons, it can really poison the well for others who want to make the database sing.

So what do developers do? They cut out the DBA role by inventing complicated but relatively low-maintenance DBs that don't need any DBA's at all - if you have performance issues just spin up another node. Whereas all they often needed was a good Development/Application oriented DBA to sort out their traditional RDBMS properly.

I haven't heard it by name, but the description is something I've heard of before - usually when app devs are complaining about only being able to query/update the database through provided views/stored procedures, so it becomes a pain to support new features.

I remember one story in particular where the app devs discovered one of the stored procedures could be manipulated to run arbitrary SQL, so they started using that instead of talking to the DBA...

war story:

worked in at least 2 different shops with the same restriction: "developers can never touch production database".

actually, it's been that way in most places large enough to have enough staff to separate things out. however, in most of those places, a dev or two would still have access to use, whether for emergencies or debugging or whatnot.

In 2 places, there was an enforced rule, where the DBA (in both cases, one and only one person) had the sole password/credentials for production. "Devs can't ever be on production, they could write random SQL - everything has to be tested". The person who said this would routinely write his own sprocs on the DB - on production only - without tests. he acted as a support person for the owners - he'd just write sprocs and custom query on production only for any reports or data updates owners and upper mgt wanted. But... those sprocs were never documented, tested, or pushed back to dev, so we never knew what was running there. If we asked for a new table, or new index, or whatever, that might cause some trouble for his hidden/untested code, we'd get blanket denials. "that causes a performance bottleneck - won't do".

The "complain about the DBAs" issue is real, but is just symptomatic of bad culture all around. It's just as bad to have any cowboy running rampant without any visibility in to what they're doing. But sometimes (often?) DBA folks seem to escape this scrutiny (perhaps just at smaller shops?)

It's a bit of a strange one - often times this sort of person is a back-end specialist and maybe part of the data architecting team depending on the size of the company.

I'd not heard this title before and it is misleading to me since I strongly associate DBA with "has nothing to do with application code" so it's a bit of a contradiction to me personally.

I've not heard it before. I was an Application"s" DBA for several years, but this was specifically for Oracle's ERP suite of business applications. The role mixed straight DBA responsibilities with middleware and business administration.

Usually this job title is development DBA or PL/SQL developer.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact