Hacker News new | past | comments | ask | show | jobs | submit login
PostgreSQL Indexes: First principles (eftimov.net)
240 points by craigkerstiens on Feb 26, 2016 | hide | past | web | favorite | 37 comments

I've come to believe that designing systems like SQL to have queries that work transparently whether or not an index exists is wrong. There may be people that don't care how their query is actually executed, but I doubt it's the common case.

So many times I've fought or watched other developers fight with getting a DBMS to use an index. Or been surprised to find out that production doesn't have that index but staging does. Or found out that someone added a lower() to a query that made the index stop working. Or found out that index turns out to be more expensive than scanning the table directly in a circumstance that is different between instances of the same software.

Contrast that to an API like the Cassandra Thrift interface. You know if you're doing something expensive because the API only exposes the cheap things. It doesn't have an implicit sequential scan. You know you're doing one because you had to type out `while(next=...)` right in your code.

Something as simple as

    SELECT mycolumn FROM mytable WHERE INDEXED b=5

    SELECT mycolumn FROM mytable WHERE SCAN b=5
and refusing to run if the assertion isn't met would go a long way to making the API predictable

> There may be people that don't care how their query is actually executed, but I doubt it's the common case.

I bet it is pretty common. Personally, as long as the query is fast enough for my needs I don't care how it executes. I only start caring when performance sucks. I wouldn't be surprised if many other developers have a similar opinion.

Indeed, this is why the trend has been away from pure "nosql" solutions in the 2000s back to SQL-oriented systems.

And not just in transactional systems but also batch processing, e.g. why would I write mapreduce code if I can just use a system like Hive that gives me a nice SQL-based interface to my data?

Confirming. Vast majority of SQL isn't written by developers or engineers to begin with, it's not even close. I administer a few DBs almost entirely queried by other departments, they'll ask for performance help (which, to date, has always consisted of eyeballing EXPLAIN output and adding an index) when something gets too slow. This happens maybe once every quarter.

I think that depends. I was on a project where we absolutely as developers had to care about DB performance. We even wrote a tool for Sybase that monitored the query plans of long running queries. We discovered that the optimizer was really poor on Sybase 11.X and we needed to force indexes a lot.

Once you learn that DB performance is not a gradual thing, you get into the grove of it. It goes from good to horrible once you hit that magic memory / table size.

Truthfully, if your developers are the ones writing tools to figure out real-time performance, you probably want to start thinking about what's going on. (obviously this does not apply to the parent poster)

> There may be people that don't care how their query is actually executed, but I doubt it's the common case.

I'm 100% sure its common.

> Or been surprised to find out that production doesn't have that index but staging does.

That's the same problem as having different application code on staging then what gets deployed to production; this isn't a problem with queries working independent of indexes its a problem of broken release processes, which requiring more ceremony to make queries work will not fix.

> > There may be people that don't care how their query is actually executed, but I doubt it's the common case.

> I'm 100% sure its common.

The nice thing about the current state of affairs is that you can first build your queries / views until you get the results you need, then look what indices you can add to improve the performance of those queries. And if you get the results you need and find that performance is acceptable, you can just omit (or at least delay) the final step.

Most databases do have a way to force the index. See for example the MySQL way of doing it: https://dev.mysql.com/doc/refman/5.5/en/index-hints.html . You probably already know this but I mention it in case someone else does not.

Does that actually fail on an index missing? The docs for FORCE INDEX say "a table scan is used only if there is no way to use one of the named indexes to find rows in the table", which to me implies it's more like FORCEFULLY SUGGEST.

If your force index names a non-existent index, you'll get an error.

> Sequential isn’t Always the Worst

Indeed! And sometimes it gets even more nuanced - when say the way the index(es) would be used would imply a lot of disk seeking (or page faults if not the whole of index(es) is/are in memory (you can "pre-warm" them into memory but if you have to do that you may have to go back to your schema and index design and re-assess it anyway)). Sometimes a bulldozer-like (mostly) (hopefully!) sequential disk read works better.

The latter becomes tricky with say SSDs - you then may need to inform Postgres of differing costs for (e.g.) disk seeks (or memory reads vs. disk reads) so it can better decide which way to go.

It's also tricky with a COW filesystem like ZFS in which the order of data on-disk may not match the logical order of the file. So your database may decide to dispatch a bunch of sequential 1MB reads to the filesystem, only for the filesystem to translate those large sequential reads into many small random 8kB reads. It depends on how fragmented the file is, so you can start off with great performance which degrades significantly over time.

Luckily SSDs have a much lower penalty for random reads compared to spinning disks, so in future this will be less of an issue.

In the meantime all you can do is use a different filesystem, hide the problem with cache, or manually defragment your data files from time to time.

Actually SSD themselves implement COW filesystem since NAND cannot be written without erase step.

That's true but not particularly relevant to database performance. As I mentioned, random reads on an SSD do not suffer from a head seek penalty so it doesn't matter how the blocks are allocated. On the other hand with spinning disk there is a huge difference in throughput for large sequential vs small random reads.

What about TRIM?

TRIM is like the free() equivalent to malloc(). It releases a NAND page back to free state for other uses. It dones't require any NAND access or writes per-se but if persistent TRIM is needed (vs best effort,) it gets much more complicated.

However mostly it is.

Consider the example with the ID's if he would have an index with all the fields he would need, example he would only need the id and the email he could create a index with them and the query would be faster, that would be a index only scan (https://wiki.postgresql.org/wiki/Index-only_scans)

Sure, I guess with index-only scans the picture does shift a bit, right? And quite a few cases are reducible to those, so this can't be ignored.

This is a good point, what the article really should be discussing is the tradeoff between sequential throughput and random disk access, which is especially important on HDDs, for those unfortunate enough to still be working with things that spin :)

True - that's what I had to battle with as well - disk seek cost vs. read speeds - Postgres does collect stats on that IIRC (in special tables), but might still need to advise it on that. And do lots of EXPLAIN ANALYZE (incl. BUFFERS sometimes) to check what happens. Re. buffers etc., I also had this nagging idea that the way Postgres sorts sets sometimes takes into account cache misses (i.e. it may use sort algo which compares elements that are not that far away - but not sure on this - also, perhaps this is to be expected anyway).

AFAICT pgsql controls this with the random_page_cost and seq_page_cost variables [1], which are relative and seem to default to 4 and 1 respectively. It doesn't look like these are automatically adjusted though based on the hardware.

Some searching and I found some blog posts suggesting lowering them both to closer to one on SSDs [2]. Indeed, the documentation says you can set them both to 1 if you are sure that the database will be fully cached in RAM.

[1] http://www.postgresql.org/docs/current/static/runtime-config... [2] http://www.cybertec.at/2013/01/better-postgresql-performance...

Aha, yeah I recall reading suggestions akin to those for SSDs. Nice!

There's also `ALTER TABLE SET STATISTICS` and default_statistics_target config parameter (for supposedly smart query planning on the fly?) (though would have to re-read to recall how these things work - off to bed soon): http://www.postgresql.org/docs/current/static/runtime-config... (edit ah you included the same link - lots of good stuff here - ahh I like Postgres documentation)

Actually, further to your point, the documentation states this explanation of the default for random/seq_page_cost:

"The default value can be thought of as modeling random access as 40 times slower than sequential, while expecting 90% of random reads to be cached

If you believe a 90% cache rate is an incorrect assumption for your workload, you can increase random_page_cost to better reflect the true cost of random storage reads. Correspondingly, if your data is likely to be completely in cache, such as when the database is smaller than the total server memory, decreasing random_page_cost can be appropriate. Storage that has a low random read cost relative to sequential, e.g. solid-state drives, might also be better modeled with a lower value for random_page_cost."

In other words, cache misses are taken into account to the extent that the model above is reflective of your workload and hw environment.

> In other words, cache misses are taken into account to the extent that the model above is reflective of your workload and hw environment.

Fair enough. Though I think they have in mind cache misses in the sense of page faults here? - i.e. "it's not in memory - have to read from disk". (i.e. cache as in Postgres internal cache (+ OS filesystem cache, actually/IIRC). The variable name `random_page_cost` seems to suggest this (load a memory page into memory - from disk).

Whereas I had in mind cache as in CPU cache[1] - where sort operation happens which does lots of cache misses (between CPU cache and RAM) - and needs to read from rest of RAM (i.e. random memory access is not uniformly random). It's a nice rabbit hole..

[1]: https://en.wikipedia.org/wiki/CPU_cache

> If you believe a 90% cache rate is an incorrect assumption for your workload

For people wondering as I did about how you might monitor for this, it looks like the data you need is in pg_statio_all_indexes

Index scans imply extra disk seeks to fetch the index. A seq scan is like Pacman chewing through a row of food; an index scan is like a pied piper dancing while he plays his flute, each time he hits the ground with his foot triggering a new disk seek. The seq scan is way faster for pure bandwidth.

If you're finding the blog post with id 10, then an index is preferable because your selectivity is low. The per-row cost is going to be higher than a seq scan, but since you can exclude most of the rows it can still be better.

If you're doing analytics, you often actually want to do as many seq scans as possible. The per-row cost of sequential access on disk is so much better: if you know you're going to want to read most rows, it's a mistake to force an index scan. It depends on the selectivity of the query.

However, index scans aren't the only way you could do low-selectivity queries. Amazon Redshift doesn't support index scans at all, but it keeps the data sorted on disk which often allows it to skip blocks as needed. The whole database is just seq scans and you have to spend a lot of effort designing your table structure to support it, but it's still blindingly fast when everything aligns. People don't use Redshift to grab 1 blog post at a time from the database, though.

> Index scans imply extra disk seeks to fetch the index.

This is no longer true in this generality, because since PostgreSQL 9.2 there are also index-only scans. Those scans take the data directly from the index, without going through the original rows at all. Of course, this works only under certain conditions. See also:



"If the index stores the original indexed data values (and not some lossy representation of them), it is useful to support index-only scans, in which the index returns the actual data not just the TID of the heap tuple. This will only work if the visibility map shows that the TID is on an all-visible page; else the heap tuple must be visited anyway to check MVCC visibility. But that is no concern of the access method's."

Despite the title saying PostgreSQL this doesn't go into any PostgreSQL specific topics like function based indexes, or partial indexes.

The stuff here is universal and applies to any database (perhaps with some terminology change).

A simple demonstration of spatial indexes used with WKT/WKB or various shape and geometry column types in GIS databases would be helpful in understanding how an index is used in a typical spatial query (e.g. distance, contains/bounds, or intersects).

There are also video and image (e.g. histogram) indexes in postgres as well, but they're much more rare.

The article doesn't mention it when talking about clustered indexes, but you can force a table to be reordered on disk to match an index using CLUSTER.


Unfortunately this is not the same as a clustered index as seen in other db's like MS SQL server.

With a true clustered index, the table IS the index, this saves space and time by not having to maintain a secondary index and then looking up the result in the table. This can be very important especially with large datasets.

Really wish true clustered indexes where implemented in PG. Kind of odd the article doesn't mention this while starting out explaining clustered vs non-clustered.

Indeed the issue that trips up many of the people I've worked with is that CLUSTER will make a table seem like the clustered index of other systems (in as much as it physically orders the table), but without an appropriately large fillfactor (and sometimes not even then), new rows and updates are not stored in indexes physical ordering.

So at first glance it's super-speedy for clustered index-like queries, but then becomes slower and slower over time as the data changes and consecutive index reads are no longer sequential on disk.

Not Postgres, but I also found this significantly (if not more) enlightening while I was building my first RDB using SQLite http://use-the-index-luke.com/sql/anatomy

Hey thanks for sharing. For more principles on Indexing specially MySQL refer this slideshare presentation. http://www.slideshare.net/hemant19cse/improving-query-perfor...

There is a specific website explaining indexing basics and different usage techniques - http://use-the-index-luke.com

I really wish there was a pgsql oriented translation of Use the Index, Luke. It seems fairly Oracle heavy, and I seem to recall the mysql stuff being explained in terms of obscure oracle features.

The world needs an integration of EXPLAIN + block io profiling.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact