Hacker News new | past | comments | ask | show | jobs | submit login
Why Can't Database Tables Index Themselves? (2006) (codinghorror.com)
158 points by SPBS 33 days ago | hide | past | favorite | 119 comments



On azure, SQL database advisor suggest (and automatically apply if you want) indexes based on the workload of the db. If you decide to apply the index, it evaluates for a while if it is working as expected and if not it rollbacks the index creation

https://azure.microsoft.com/it-it/blog/simplify-performance-...


This is a great feature. We manually added the indexes it recommended for a while but eventually found ourselves always adding the suggestions so we turned on the automatic creation.


I have only used it on Oracle 10g, but dbms_sqltune is a PL/SQL package that implements the SQL Tuning Advisor.

I wrote some PHP for my users to find their queries and run the analyzer, and it will suggest new indexes among other performance-enhancing options.

There is also this SQLite index suggestion tool on the front page today...

https://news.ycombinator.com/item?id=31991598


Some analysis of the Azure / Sql Server 2017+ automatic index tuning:

https://www.brentozar.com/archive/2019/02/what-does-azure-sq...


Azure Cosmos also has a default index which you can override for better performance.


In modern Prolog systems, this is in fact done: When looking up clauses from the database, the Prolog system builds an index as needed based on which arguments of the query are instantiated. This is called Just-In-Time (JIT) indexing.

Modern Prolog systems also perform multi-argument and deep indexing. In older Prolog systems, you sometimes had to declare explicitly on which arguments the system should index, analogous to those database systems where you have to declare indices manually.


It sometimes feel like me learning Prolog completely ruined me for SQL. Doing even simple things seems ridiculously hard with SQL.


In his seminal paper A Relational Model of Data for Large Shared Data Banks, Codd explicitly mentions first-order predicate calculus as a suitable formalism for the sub-language. Quoting from https://dl.acm.org/doi/pdf/10.1145/362384.362685:

"The adoption of a relational model of data, as described above, permits the development of a universal data sub-language based on an applied predicate calculus. A first-order predicate calculus suffices if the collection of relationships is in normal form."

One great advantage of using Prolog is that queries themselves are so readily amenable to analysis and reasoning with the same mechanism, and it becomes very easy to optimize them by reordering goals.


You just piqued my interest in Prolog more than any pitch I had heard until now. Would you care to elaborate? How is it nicer than SQL?


Unlike SQL, Prolog's core language is both simple and composable: about 200 pages. Sql is just a bunch of special cases and unfortunate historical accidents.

It's a lot like... Php4 in comparison with StandardML.

And, sadly, both SML and Prolog are fading away, while SQL (and php) are here to stay.


Doesn't SML live on in Ocaml and F# ?


These are parallel branches of the same language family, along with Haskell. They are relatively popular but I can't say they flourish.

Besides, SML always was the cleanest and most consistent of the bunch. Which kind of supports my point: it's not always the best language that wins.


For me it is the other way around.

Prolog allows effects on its database while executing the "query" - you can add or remove facts from database. SQL prohibits that, separating queries and data manipulation statements.

And, of course there are things like list comprehension in Haskell and other languages which are more or less the same as SQL but definitely more succinct.


> SQL prohibits that, separating queries and data manipulation statements.

With the exception of the case where the set of rows you want to return (a projection of) are exactly the same ones being modified.

IIRC, you can hack around this about with JOIN in the FROM clause of the main statement even if it is only of substantive use in the RETURNING clause, to get what amounts to arbitrary data modifying queries.


Hm, the language does not enforce it, but you _can_ separate things neatly for sure

I may have missed something, though, are you thinking of something other than asserting and retracting?


Can you syntactically write a SELECT query that also does an UPDATE/INSERT/DELETE? And Bobby Tables doesn't count - using a ; clearly indicates the start of a new query/ statement.


As I see it, the fact that this (legitimate!) question even arises is saying a lot about SQL syntax:

Can we syntactically write a SELECT query that also does an UPDATE/INSERT/DELETE? It may be possible, or it may be not possible, and to find out which it is, one has to have a good grasp of a quite complex syntax specification.

For comparison, in Prolog, queries have a very uniform and simple syntax, and a small set of predicates is used to modify the database. You can find out whether a query performs any of these operations by meta-interpreting the query. It is easy to write a meta-interpreter for Prolog queries, and to allow only calls of predicates that you know do not affect the database. In comparison, it is extremely hard to analyze and interpret SQL queries, also because the syntax of SQL is so complex that we cannot easily answer such basic questions about the syntax.


The meta-interpretation will not say anything more than "this query may modify facts' database". Anything above that requires actual interpretation with rollback of changes. Because, you know, of halting theorem.

Last time I checked, rollback of changes made into facts database was not even a notion in Prolog language. There was no notion of transaction in Prolog.

"meta-interpretation" of SQL statement amounts to looking at the statement type, whether it is an INSERT/DELETE/UPDATE or just a SELECT query.

There is no much complexity in the SQL syntax. My first experience writing more or less non-toy parser was to parse SQL schema, it was 20 years ago. SQL is not harder than VHDL with its context-sensitive lexing phase, where you can't answer is '(' a character literal or is it a part record type value construction. Maybe, DELIMITER can bite you, but, again, it is not hard to parse with parsing combinators.


No, but you can write an UPDATE which uses a SELECT in all of SET, FROM, and WHERE, which broadly speaking will do the same thing.

The difficulty in figuring out all this syntactic weirdness is probably overstated for small queries, but it's there. Most attempts to write a better SQL I've seen succeed in narrow terms, but if the goal is to replace SQL rather than write a better syntax which (almost) no one uses, that's not enough.


That's what I had always assumed - i.e. if the first keyword is SELECT it can't possibly modify the data, which does match the OP's claim that SQL separates queries from data manipulation commands.


I see a small impedance mismatch between "SQL prohibits updating in queries" and "SQL requires you to write query updates a certain way", though I see where OP is coming from. One of the reasons mini/microKanren has taken off in research is the use of immutable data structures to prevent the morass you can get with self-modifying Prolog queries.


Except that all DML commands can also be queries by way of RETURNING.


SELECT drop_table_procedure() ;


Ah, yep, good one, definitely see a danger there.


> Can you syntactically write a SELECT query that also does an UPDATE/INSERT/DELETE?

No, but you can write an UPDATE/INSERT/DELETE that returns a result set like a SELECT.


> SQL prohibits that, separating queries and data manipulation statements.

Under SQL Server, you can attach OUTPUT clause to INSERT/UPDATE/DELETE/MERGE and effectively have a query that both modifies and returns data. Other DBMSes have similar capabilities.


Yes, they do. So what? They are essentially equivalent to two separate statements, a query and a DML statement, executed inside transaction.

Meanwhile Prolog does not even have a notion of transactions.


I've only ever seen/used Prolog with a fixed set of facts (axioms? I forgot how they are called). Are there modern Prolog systems with on-disk storage that could replace a DBMS?


The SQLite CLI has the .expert mode (marked as experimental) [0]. Turning it on will suggest indexes for the queries executed.

[0]: https://www.sqlite.org/cli.html#index_recommendations_sqlite...


The obvious answer is that there are certain queries that cannot be made fast with normal indices. What index would you suggest for these queries?

  SELECT * FROM Foo WHERE Bar != 42;
  SELECT * FROM Baz WHERE x > 42 ORDER BY y DESC LIMIT 10
The usual solution is to constrain the query language in some way to prevent users from issuing queries that can't be satisfied efficiently using an index. If you are disciplined enough to do that then you can probably pretty easily suggest (or even automatically generate) indices for your users based on query patterns.


If most of the Foo table has Bar=42, an index on Bar works, or you may have to rewrite the query as (Bar < 42 or Bar > 42) on some DBs. If Bar=42 is just a few rows then yeah it's best to do a table scan.

For the second query, an index on (y, x) is very good most of the time. You would use the natural sort order of the index, and prune x <= 42 rows from the result set by only using the index. If almost all of the table is x <= 42 it's better to index on x, have it sort everything and throw away beyond the limit. Although if you are specifying a limit just for pagination, then it's decent to get e.g. the Nth element from the y sort order, and then only query where (y > ? and y <= ? and x > 42), possibly with a different limit and some other clauses if y is not unique.

I feel like these types of queries are basically non-existent in OLTP though. If you are going to block expensive queries, knowing the queries ahead of time makes it a lot easier.


> I feel like these types of queries are basically non-existent in OLTP though.

The second query feels pretty normal in any system that has filters and sorting on a collection.


SELECT * FROM Foo WHERE Bar != 42; if Bar was an INT and had a high density (1/distinct values) then a columnstore index on Bar might perform faster than a per-row index. Worth trying out as it isn't certain and would depend on row cardinality.

SELET * FROM Baz WHERE x > 42 ORDER BY y DESC LIMIT 10; a per-row index on x should in theory seek all values above 42 rather than perform an index/table scan, and the ordering and row limitation would be processed last i.e. in the SELECT. Again, not certain though.

SQL Server already has the missing_index_stats DMVs which can suggest indexes based on previous query use, and generating frequency diagrams / histograms based on previous queries run is relatively straightforward and would contribute to any index suggestion engine.


I don't think either of this would be difficult:

For the first, if the 42 value is fixed, you can actually make that query quite fast. In MySQL you could create a functional index for the boolean condition and quickly find all the rows that don't meet that criteria. Basically all the rows are sorted at insert and your query would take ms (constant time).

For the second, even if the 42 is not fixed, you could create a composite index for X, and Y DESC and get very acceptable results (log time).


The 42 was meant to be a placeholder in the examples. Imagine it is a ? and bound at runtime instead.

In the second query, a composite index over (x, y) basically does the same job as an index over just x. The general query plan is to iterate over all values > x and keep the 10 smallest values around in memory to satisfy the ORDER BY.

The point is that there are many queries that seem deceptively simple but are actually extremely hard or even impossible to automatically index. Those queries which result in table scans, large index scans, and large sorts can easily dominate all the other queries which are easily indexable in terms of resource usage.


Gotcha. Yeah, you'd definitely be constrained to log time, though with proper bucket sizing it'd be manageable if the values were more bounded. For large datasets, there's just not a real good way to make this fast if you need an _exact_ answer. If you can deal with a probabilistic answer there's faster methods.


So basically I realize there are ways to generate index suggestions (or even generate indices automatically) by observing query patterns, keeping stats on cardinality of various columns, etc. We actually did that at Parse; we would observe query patterns and generate indices (including compound indices) at runtime that seemed to benefit the application's query pattern. There wasn't an option for this, it was just enabled by default.

However we ran into the problem that users still made many queries which were essentially impossible to index. On many of the DB nodes, the resources used by these unindexable queries dominated the resources used by all of the easily indexed queries.

I left the experience thinking that there really wasn't any good solution to the problem for general users other than making the query language force users to only make queries that can be obviously satisfied by common index types.


I think a good DBA ought to be able to better than an automatic process possibly could. The database can't know whether an index ought to be unique, you have to understand the data model to know that. You might want to make an index multi-column even if you aren't using all the columns the first time you make a query, but again you need a human-level understanding of the data model to do that. And if you already have indexes that work reasonably well, you can't expect an automatic optimizer to realize it could do even better by replacing existing indices. That being said, an automatic indexer could probably be "good enough" in many cases.


> I think a good DBA ought to be able to better than an automatic process possibly could

Absolutely, although in modern orgs I'd expect this to be the domain of backend engineers rather than "a DBA", a mythical creature I've never actually seen in real life.


Oh they exist alright, I have met them. They are bearded old men that rule the kingdom of Oracle in a dusty room that smells of stale coffee. If you ask them if you may please have a table they respond angrily "NO, you may NOT have a table!"


I was an Oracle DBA for 11 years.

The joke is:

Q: What is the difference between a DBA and a terrorist? A: You can negotiate with a terrorist.

After being a DBA, I moved into security, so I could tell people NO with even less consideration and more cruelty.


I sometimes think this was the actual use case being solved by the nosql movement.


all new technology solves a problem, sometimes it's a social problem


And you would be right.

Grumpy old men complaining no one in the company is able to write good enough queries.

We got rid of them when we migrated from Oracle to AWS Aurora.


This made me spit out my stale coffee.


Where I'm from, DBA is just a euphemism for "employee with write access to the prod DB"


Time an again AI systems outperform humans. This maybe true today but not tomorrow


"Peloton is a relational database management system that is designed for autonomous operation. The system’s integrated planning component not only optimizes the system for the current workload, but also predicts future workload trends before they occur so that the system can prepare itself accordingly. It also enables new optimizations that are not possible today because the complexity of managing these systems has surpassed the abilities of human experts."

https://pelotondb.io/


It depends also on what you want to prioritize. Indices add insertion overheads. You can have general rules for indices but ultimately it comes down to what operations are time-critical.


Indices also can be extremely large. For example, a

  SELECT foo, bar, baz FROM quux WHERE foo = ?;
will be fastest with an index on foo that also stores bar and baz (once the index entry is found, the engine won’t need to do a disk seek to the row data). If that table is large and you run

  SELECT foo FROM quux WHERE foo = ?;
(Edited after zimpenfish’s remark about a copy-paste error)

more frequently than the former query, just creating an index on foo alone may be the better choice. Whether it does depend on lots and lots of factors such as query distribution (is it worth it to make the monthly report twice as fast at the cost of running many other queries 1% slower?), disk speed (if you move a database to SSD you should revisit all indexing choices), and disk space.

With advanced SQL engines, there’s way more choice. Maybe, it’s best to create an index on the first n character of a string field, or an index that only works for equality searches, not for phrases such as foo < 3, etc, or an index on only the data from last month (can be done in some databases with partitioned tables)


> SELECT foo, bar, baz FROM quux WHERE foo = ?;

> SELECT foo, bar, baz FROM quux WHERE foo = ?;

Possibly a typo in the second query since the text indicates it should be different from the first?


If a database is expected to be good enough to optimize query plans by default, wouldn't you expect it to be also good enough to optimize layout/indexing?


The problem is that indexes aren't free; an index can easily take up gigabytes of disk space, if not more. There are also performance considerations with respect to insert performance.

What a "good" index is depends on the use case; for example for a webapp a 0.25 second query that's run on every pageload is extremely slow, and creating an index for that is almost certainly a good trade-off. On the hand, it's usually fine if some analytical query that only your finance people need once a month runs for a minute; in that case, an index is probably a poor trade-off. Or: in my app one query takes about ~5 seconds for a rarely used feature; I could add an index to make it faster, but it would also eat up ~25G of disk space (last I checked, probably more now). The very small number of people using it can wait an extra few seconds.

It's not too hard to automatically create indexes; I'd say it's almost easy. But these kind of judgements based on use case isn't something the database can do, and you're likely to end up with dozens of semi-useless indexes eating up your disk space and insert performance.

This is probably one of those cases where automating things will lead to problems and confusion; it's probably better to spend time on tools to gain insight in what the database is doing, so a human can come along and do something that makes sense.


> But these kind of judgements based on use case isn't something the database can do, and you're likely to end up with dozens of semi-useless indexes eating up your disk space and insert performance

That is an empirical question and another comment points to this working just fine: https://news.ycombinator.com/item?id=31991469

Even ignoring that, creating indexes is a matter of tradeoffs - it's not the index creation that's difficult but the decision on whether the tradeoff is worth it. Seeing this, possible issues arising from automatic index creation can be mitigated by allowing the admin to set parameters that dictate where the tradeoff should be made (as a naive example: "I'm okay with using 10Gb of storage if it results in a 20% improvement for P95 queries on this table").

This is not an unreasonable idea and it sounds like it would greatly improve UX for devs working on median-sized DBs (people needing FAANG scale can manually tweak their DBs). Worst case, you can have a whitelist approach to automatic indexes, where the admin is shown index suggestions which require manual approval.


>for example for a webapp a 0.25 second query that's run on every pageload is extremely slow, and creating an index for that is almost certainly a good trade-off. On the hand, it's usually fine if some analytical query that only your finance people need once a month runs for a minute; in that case, an index is probably a poor trade-off.

But that should be automatable. If the DB sees frequent queries, it would create indexes, if those queries cease, it would drop them. If queries are infrequent, it would not create an index.


Would not want to be responsible for a database where some AI was vigorously creating and dropping indexes behind the scenes, just as I was trying to debug performance issues.


Traditionally query optimizers are focused on producing good and not disastrous plans. Not necessarily the best plan. This is the role of a query optimizer: do decently well and don't royally screw up. What makes optimizers to screw up is cardinality misestimations which leads to the wrong join order or wrong join types (nested loop join thinking that a subresult in a query plan is 3 rows where in reality it's 1M).

Optimizing physical layout presents new challenges:

- it takes time to build those layouts and consumes resources. nothing like adding a load to a database that's already struggling to keep up with the load :)

- you don't know what you are going to break

With all that said that's the future and the industry has to figure it out.


They're often terrible at optimizing query plans, especially when there's no index, since the index provides data about the data (cardinality, for example) into the query optimizer.


The what to optimize then also becomes an optimization problem.

Devs struggle to plan for read/write workloads. Making more indexes can then make the same inbound volume from the previous day require n more writes per row.

You can still get surprised by query optimizations on relatively simple joins just changing query strategy when hitting tuning parameters.

Making analyzers more visual and giving lots of suggestions (that could then be ignored) would b a good intermediate step.


> You can still get surprised by query optimizations on relatively simple joins just changing query strategy when hitting tuning parameters.

I imagine there is a large bias about caring or not about this depending on what DBMS you are used to.

It just happens that on some DBMS you get "surprised" as "oh, shit, this problem again", and on others it usually comes as a surprise as "wait, I never thought about optimizer limits". The second set is much more prone to automating.


The one time I spent with an SAP sales guy in an elevator, he tried to convince me that SAP needs no manual indeces, because their database is auto indexing. Very convincing, because that might be the reason why SAP queries are so "fast". Like hours for a simple unindexed date range search.


@dang Mark this from 2006.

Most modern databases do auto optimization, stats, and index tuning.


And techniques like learned-indices [0], optimizers like ottertune [1] / dynimize [2], and alt-views like materialize.io and readyset.io exist.

[0] https://news.ycombinator.com/item?id=25899286

[1] https://news.ycombinator.com/item?id=28868382

[2] https://news.ycombinator.com/item?id=17025627


Friendly heads up, @dang, @mods, and everything in that category doesn't do anything.


Correct. Best option is to email mods at hn@ycombinator.com

Though someone might also see the mention in a thread. There's no specific notification value in tagging mods or admins howevers.


Check out this project from CMU, one of the goals is to use ML-driven auto-indexing. https://noise.page/about/


this is from the group run by Prof Andy Pavlo who is the founder of ottertune (referenced in other comments)


> Why Can't Database Tables Index Themselves?

I think the company Snowflake Computing was founded on this premise?


The correct answer here is that any single query lacks the context of the entire suite of queries that will be performed against the database and their relative importance and sensitivity to performance.

Indexes compete with each other and with table data for space in memory cache(s). They also require significant cpu and IO resources at write time to be kept up to date. Having too many indexes can lead to having them not being used (best case) or having them evict each other or more important data over and over from caches. In many cases it is better to not have one index that helps some specific query perform 50% better if it means another index is able to stay in memory that makes some other run in 1/500th as much time. It's often better to just scan data and have queries take an hour if they are for nightly reports rather than create the enormous multi-column indexes required to speed them up, since you don't care about the reporting queries taking an hour over night (at least in this hypothetical).


It's possible for sure.

The problem is that in such a scenario, there is no requirement or even recommendation for query writers to structure queries with regard for existing indices. You will rapidly end up with too many partially-overlapping, including the overhead to maintain them.


On almost every time, this is still a large gain. And on both cases, if you have complete coordination of the developers, you get the same result.

You will probably want a "what query uses this index?" feature, that is the other side of the coin of "what indexes make my queries faster?"


I think you can afford that perspective until the seek time on indexing becomes slower than your system generates rows or your index size begins to eclipse memory capacity.

Having been there, done that, I can assure you more indexes are not always your friend.


Well, if your system is not measuring the cost of the index, and deciding on removing them, than you are correct in that it's useless.


The best comment is on the page itself, "DBAs don’t like the DB server doing things that they didn’t expect."

Imagine how much less predictable your database becomes when it's trying to be "smart". I wouldn't want to be oncall for that thing.


Makes me wonder if you could use a SQL Injection exploit to maliciously trick out the index generation feature, and create a totally shit index and cause a DOS attack.


Could probably tip weighting to create unnecessary indexes in some cases. Really depends on how automatic the index creation is, could reek with potential exploits.


If you have already found SQL Injection

then why not do it yourself?

or a lot of other bad things?


I don't want DBAs optimizing my indexes. With elastic horizontal scaling, I'm not sure I want DBAs at all.


A somewhat related question: Why can't you add an index to a simple in-memory list? It would be great if you could have a std::vector or Python list of plain old objects, and then tell it to add an index (using a hashtable for example) on one or two of the fields.

I played with something like that in Python a couple of years ago, but never made it production-ready. I think in many simple cases it could replace a dedicated (SQL) database.


Because, at least in C++ and Python that you mention, you can mutate (a field of) an element of the list without the list noticing, and hence without the index being updated. In the general case, you’d not only have to wrap the list, but also its elements, and wire the element wrappers up with the list. So it’s better to have the application do that because it knows when the index needs updating for a given element (and its old & new value).

The other reason is that most often you only need an index on one field, and don’t need the elements to be in a particular order, so you just use a map/dict with the value of that field, instead of a list.


> Why can't you add an index to a simple in-memory list?

The index is already a "simple in-memory list", sorted. That's why it is expensive to mantain.

> It would be great if you could have a std::vector or Python list of plain old objects

Sort said Python list by the columns in your index.

> and then tell it to add an index (using a hashtable for example) on one or two of the fields.

Hashtable is another thing entirely, and designed for fast key retrieval at very high memory and CPU cost. If you can't afford a index you can't afford a Hashtable/dict either.


Every time I've found myself heading down this rabbit hole of creating a faster Python list sort/search, I've inevitably realized that using an in-memory sqlite database is orders of magnitude faster and more flexible, especially once you start trying to mangle lists or sets with millions of elements.

The added benefit is that you can also dump it to the drive if you have to prematurely end your task and resume it from the same place later. And it also supports indexes, both traditional and full-text searching.

It's a lot easier to use a tool designed for the task instead of building one from scratch.


> It would be great if you could have a std::vector or Python list of plain old objects, and then tell it to add an index (using a hashtable for example) on one or two of the fields.

This is what a LinkedHashMap (1 index) and HashBasedTable (2 indices) do in the jvm-land. Admittedly, there isn't a convenience api to convert a given collection/vector to a map/table (but hey, if you're like me and write a lot of golang, then the current jvm-land status quo is already heaven!).

https://guava.dev/releases/snapshot-jre/api/docs/com/google/...


Honest question: how many applications are using in-memory lists that are large enough to warrant indexing but are also too small to put in a database?

That's probably why such a library doesn't exist. Sorting & search is good enough for one field. Full scans aren't that bad for small/medium datasets. And databases aren't that difficult to integrate into an application.

Also, pandas exists in the Python world.


Usually laziness, in my experience. that internal "fine, I guess I'll open the SqlAlchemy docs and spin up a database instance for this instead of monkeying around with python sets and lists" argument developers have with themselves.

Pandas is also very powerful, but everytime I go to use it I feel I have to learn the API all over again.


This talk explored a bit that idea, you could have a (full) programming language which is more like SQL so you get good enough datastructures (and indexing- without manual tuning: https://www.youtube.com/watch?v=gC295d3V9gE

I think Prolog (another commenter pointed it out) and datalog do this to some extend.


This is awesome idea! I'm sure there is an implementation somewhere.



Boost has something called MultiIndex.


The obvious is that an index is a tradeoff, slower writes and more disk space for faster reads. Unless the db can know disk space and write perf is worth the tradeoff it would be difficult for it to make good decision.

If you didn't care about disk space and the db could do async indexes but not have stale reads then an automatic index system could make sense.


Disk space is becoming less of an issue in cloud environment with the rise of multi-tenant storage: aurora, neon (i'm ceo of neon). The cost of index maintenance is real and should be considered.

Due to the multivariable nature of the this problem ML is the right approach to solve it and the ability to "prove" that it will work is also key.


Constraint programming might be an alternative if we want "explainability".


I thought Oracle already has this feature - https://www.oracle.com/news/connect/oracle-database-automati...


It does, though are some restrictions in regards on which versions it can run on, unless you want to enable features that are not officially supported: https://oracle-base.com/articles/19c/automatic-indexing-19c

My issue is that you cannot (easily) delete the automatically generated indices, should you want to do that for whatever reason. Of course, Oracle will try to only enable the indices that will lead to actual performance gains, but when it comes to removing the automatically created ones altogether, the best that you can do is:

  DBMS_AUTO_INDEX.CONFIGURE ('AUTO_INDEX_RETENTION_FOR_AUTO', '1');
Consider that you have a few different environments, all of the same database schema with a few hundred tables in it. You might want to take one of those environments (a development environment) and enable the feature there, have a bunch of indices be automatically created, take the DDL for those and turn them into migrations that can be run against other environments (any number of testing environments, eventually production) either manually or through some automated SQL migration solution.

The problem is that you can't easily get rid of all the automatically generated ones so that you could test this migration against the same environment (basically create those same indices through your own SQL).


Here's a database that does much better than this, it creates and fine-grainedly updates materialized views of the queries you're using: https://github.com/mit-pdos/noria

Here's an excellent interview with the creator: https://corecursive.com/030-rethinking-databases-with-jon-gj...


This this this a thousand times this I've had buliding a system such as this for postgres on my bucket list for almost a decade now. Why is this not a thing


There's a lot of good ecosystem stuff around this, though I personally can't use most of it as the HypoPG extension isn't on AWS RDS. :(

https://github.com/HypoPG/hypopg

HypoPG is a PostgreSQL extension adding support for hypothetical indexes.

An hypothetical -- or virtual -- index is an index that doesn't really exists, and thus doesn't cost CPU, disk or any resource to create. They're useful to know if specific indexes can increase performance for problematic queries, since you can know if PostgreSQL will use these indexes or not without having to spend resources to create them.

With one approach to actually implementing it here: https://www.percona.com/blog/2019/07/22/automatic-index-reco...

https://github.com/ankane/dexter is an autoindexer project that builds on hypopg, and I believe it can be configured to autocreate indices — it will only suggest them by default.


Doing this well for a production system is a hard problem. The sibling comment pointed out HypoPG and Dexter, but I'm not familiar with folks using a simple approach like the one Dexter implements on a production system. The human element is valuable when indexing, to correctly model the workload understanding and assess the trade-offs (index write overhead, etc).

For context, I've personally been working on an automatic indexing system for Postgres for more than a year now [1], and whilst I think what we have today is pretty useful, there is still work to be done. We've intentionally not yet enabled full automation (i.e. actual automatic creation of the indexes on the production database), because most people I've talked with prefer to review a recommendation and then apply it through their regular migration tooling, after making an assessment.

If you ever want to talk more about this, feel free to send me an email - I've spent a lot of time thinking about this topic :)

[1] https://pganalyze.com/blog/automatic-indexing-system-postgre...


One of the reasons is "safety". It's hard to make sure that adding an index won't have adverse effects on your app. Adding an index invalidates query plans and new plans may or may not be as good as the old ones.

I think the enabling technology for this is branching and the ability to "fork" a database AND the workload to prove that adding an index doesn't break things.


So we’re back to Kubernetes! ;)


Except that Kubernetes (at least last time I checked it did) more or less ignores exactly this hard part - keeping state sane and manageable - because it's hard and implementation/vendor dependent.

The only reasonably ready "Kubernetes" project I saw attempting to solve this was SUSE/Rancher longhorn, and that was still somewhat incubating last time I checked.


DBMS don't do this by default because indexes aren't free. Each additional index means the database has to write to an additional location for (potentially) every insert/update.

The feature would need to also keep track of which indexes are unused and remove them over time to avoid the massive write amplification that would come from building a new index for every query that crossed the threshold over time.


I had a different thought also when thinking of this back in the days. Compilers rolled into the application learning from the usage what parts are important and doing a recompile when it starts to seem useful. Code that is "never" used can be left on disk until it is needed. The clients can upload their strategy to a central server to configure default configs for novice, average and advanced users.




If I remember correctly Google AppEngine's NDB had such a thing maybe 10 years ago. Every time you queried the database it would create an index. I don't know how smart that mechanism was.

I don't know how things are in Google Appengine land now.


Indexes typically only speed up reads while also slowing down writes (sometimes massively).

SARGability of the queries is also needed to effectively use the indexes.


Article title could use a (2006) @mods @dang


B+ tree index updates are not free, and get slower and slower as the rowcount increases.


Firestore not only recommends, but requires you to have the index or the query will fail.


Being obvious, if a field is called "index" (or similar) and doesn't have an index on it maybe the database could default to making it one.


"Obvious" for you, maybe for a general AI, but not to the DBMS.

What if such field is a blob with a uncompressible 12mb blob of a image of a index card?

What if indexing them would waste several TBs of RAM? That is not a good default.

On a more realistic scenario, it is reasonably rare that you will only ever search using the PK, and no other constraint, so in a way is a wasted default index.


Good point, make that only apply to fixed length chars (not varchars). I think JOINs use indexes which are often done on the primary key. https://www.mssqltips.com/sqlservertutorial/3207/make-sure-a...


It is called clustered index, used by MySQL. You still need different ordered index.


RavenDB has auto indexes


what are sql statistics.


The "modern" DB world reminds me of dentistry. Zero reason a database can't be plug and play right now.

Dentistry wants to remain in the dark ages, and so does the database world. The rest of us just want something like the electrical (non-Texas) grid. Just works. When it doesn't, it comes back up, and all is fine.


You can get a plug-and-play database experience at any of the dozens of companies that will host a database for you. You'll have to define your own tables and indexes of course, but the grid doesn't take care of the wires inside your house either.

The electrical grid is kept up by the continuous effort of tens of thousands of employees across the country btw, it's hardly magic. I bet most people would be horrified if they realized how much manual planning is involved in keeping the electrical grid running.


What would dentistry that isn't in the dark ages do differently? Genuinely curious.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: