The SQLite Index Suggester

yvan · on July 5, 2022

I find it interesting, as recently I was reading about how complex it would/could be to create an index suggester.

https://www.depesz.com/2021/10/22/why-is-it-hard-to-automati...

Someone · on July 5, 2022

It’s like writing a compiler or interpreter: writing one is easy; writing a good one extremely hard.

This suggester isn’t very good. It takes a single query and suggests indexes for it. A good one would take a mix of queries and suggest a set of indexes, also considering the impact on write speed of additional indexes (table updates often need to update indexes, too)

For the example in this article, if the table is large and the average number of rows with a given ‘a’ value is close to 1 or if most queries are for ‘a’ values that aren’t in the database, it may even be better to do

  CREATE INDEX x1a ON x1(a);

That gives you a smaller index, decreasing disk usage.

lfittl · on July 5, 2022

As one of the authors of the tool that was critiqued in depesz' blog post from last year, I fully agree that looking at this on a per-query basis is too simple.

Since we wrote our initial index suggestion tool for Postgres, we actually went back to the drawing board, examined the concerns brought up, and developed a new per-table Index Advisor for Postgres that we recently released [1].

The gist of it: Instead of looking at the "perfect" index for each query, its important to test out different "good enough" indexes that cover multiple queries. Additionally, as you note, the write overhead of indexes needs to be considered (both from a table writes / second approach, as well as disk space used at a given moment in time).

I think this is a fascinating field and there is lots more work to be done. I've also found the 2020 paper "Experimental Evaluation of Index Selection Algorithms" [2] pretty useful, that compares a few different approaches.

[1] https://pganalyze.com/blog/automatic-indexing-system-postgre...

[2] https://www.vldb.org/pvldb/vol13/p2382-kossmann.pdf

zimpenfish · on July 5, 2022

> This suggester isn’t very good. It takes a single query and suggests indexes for it.

The underlying API can analyse multiple queries - looks like they've only coded up the test `.expert` command for one.

From [1], "The sqlite3expert object is configured with one or more SQL statements by making one or more calls to sqlite3_expert_sql(). Each call may specify a single SQL statement, or multiple statements separated by semi-colons." then "sqlite3_expert_analyze() is called to run the analysis."

[1] https://www.sqlite.org/src/dir?ci=trunk&name=ext/expert

rileymat2 · on July 5, 2022

>It takes a single query and suggests indexes for it. A good one would take a mix of queries and suggest a set of indexes, also considering the impact on write speed of additional indexes (table updates often need to update indexes, too)

This is my pet peeve with SQL Server SSMS will give you a missing index suggestion and cost... the problem is inexperienced people will take the suggestion as is and create way too many highly specialized indexes over time.

slaymaker1907 · on July 6, 2022

Yeah, a good autoindexer might even suggest you get rid of certain indices if they don't seem very important due to the overhead you see on inserts.

stavros · on July 6, 2022

That's a great primer on performant indexes, thanks!

realPubkey · on July 5, 2022

I did something similar for a NoSQL database [1]. The biggest surprise was how much the query performance can change for an index when the data distribution changes slightly. For example using a real distribution for an 'age' field instead of just using a random number like in the test data.

[1] https://rxdb.info/query-optimizer.html

contingencies · on July 5, 2022

Sin #5 - Unrepresentative workloads: The common assumption in academic research systems is that the cluster workload is relatively homogenous. Most research evaluations measure performance by running a single job on an otherwise idle cluster. - Schwarzkopf et al, 'The seven deadly sins of cloud computing research' (2014)

.. via https://github.com/globalcitizen/taoup

jessaustin · on July 5, 2022

It seems better for birthdate to be stored in the database and age just to be calculated when needed?

realPubkey · on July 5, 2022

Yes of course you better store the birthday normally. This was just a projection of the real data set, used in my testings.

HPsquared · on July 5, 2022

The 'age' thing might be due to the difference between floats and ints?

Integer "age" has many repeats, but random floats are unique. That, or random ints might be from a large pool, again not many repeats.

krylon · on July 5, 2022

This sounds really cool!

I've sometimes wondered why server-based RDBMSs don't offer something like this. Is it too hard to implement? Or did people just not think of it? Or do they have something like this and I just never learned about it?

A321321 · on July 5, 2022

Microsoft SQL Server definitely has suggestions for missing indexes. The quality of the suggestions are debatable though

gfody · on July 5, 2022

sql server has the index tuning wizard (itwiz) that will sample your data and make suggestions, and more lately the missing index stuff as a consequence of how the execution planner is designed (iirc it tries to generate optimal plans regardless of what indexes are available and when plans are eliminated because the necessary indexes don't exist it emits records for the various "missing index" dmvs)

dmitriid · on July 5, 2022

Microsoft SQL Server query analyzer was was essential in identifying missing indexes. Wherever you saw "full table scan" on a table, you knew it was missing an index.

I don't know if it's still around, but in mid-2000s it was light years ahead of any other database.

striking · on July 5, 2022

https://explain.dalibo.com/ and https://explain.depesz.com/ are indispensable tools for visualizing Postgres' EXPLAIN output in a similar way. I will agree that the MSSQL query visualizer is more visually effective, though.

iasay · on July 5, 2022

Indeed. It's even more fun when it starts generating really bad execution plans for table statistics it has completely got wrong, almost always while I'm eating my lunch.

krylon · on July 5, 2022

Thank you!

pronoiac · on July 5, 2022

You'd probably enjoy this other discussion on the front page: https://news.ycombinator.com/item?id=31990836

abraae · on July 5, 2022

Oracle (and likely others) have had this since well into the last century. As you might expect since it can save a huge amount of money on a large database, it's part of an (expensive) add on.

SomeOther · on July 6, 2022

Oracle can also do it autonomously if you have the super duper expensive package. It finds useful indexes based on what's queried regularly, tries them out behind the scenes to see if it'd improve things, and then creates them if they'd be useful.

ayende · on July 6, 2022

RavenDB has this feature, but it will also go ahead and create the relevant indexes needed on the fly.

It is able to do so without harming production performance.

See: https://ravendb.net/learn/inside-ravendb-book/reader/4.0/9-q...

hardwaresofton · on July 6, 2022

In Postgres land, hypopg:

https://github.com/HypoPG/hypopg