I wrote some PHP for my users to find their queries and run the analyzer, and it will suggest new indexes among other performance-enhancing options.
There is also this SQLite index suggestion tool on the front page today...
Modern Prolog systems also perform multi-argument and deep indexing. In older Prolog systems, you sometimes had to declare explicitly on which arguments the system should index, analogous to those database systems where you have to declare indices manually.
"The adoption of a relational model of data, as described above, permits the development of a universal data sub-language based on an applied predicate calculus. A first-order predicate calculus suffices if the collection of relationships is in normal form."
One great advantage of using Prolog is that queries themselves are so readily amenable to analysis and reasoning with the same mechanism, and it becomes very easy to optimize them by reordering goals.
It's a lot like... Php4 in comparison with StandardML.
And, sadly, both SML and Prolog are fading away, while SQL (and php) are here to stay.
Besides, SML always was the cleanest and most consistent of the bunch. Which kind of supports my point: it's not always the best language that wins.
Prolog allows effects on its database while executing the "query" - you can add or remove facts from database. SQL prohibits that, separating queries and data manipulation statements.
And, of course there are things like list comprehension in Haskell and other languages which are more or less the same as SQL but definitely more succinct.
With the exception of the case where the set of rows you want to return (a projection of) are exactly the same ones being modified.
IIRC, you can hack around this about with JOIN in the FROM clause of the main statement even if it is only of substantive use in the RETURNING clause, to get what amounts to arbitrary data modifying queries.
I may have missed something, though, are you thinking of something other than asserting and retracting?
Can we syntactically write a SELECT query that also does an UPDATE/INSERT/DELETE? It may be possible, or it may be not possible, and to find out which it is, one has to have a good grasp of a quite complex syntax specification.
For comparison, in Prolog, queries have a very uniform and simple syntax, and a small set of predicates is used to modify the database. You can find out whether a query performs any of these operations by meta-interpreting the query. It is easy to write a meta-interpreter for Prolog queries, and to allow only calls of predicates that you know do not affect the database. In comparison, it is extremely hard to analyze and interpret SQL queries, also because the syntax of SQL is so complex that we cannot easily answer such basic questions about the syntax.
Last time I checked, rollback of changes made into facts database was not even a notion in Prolog language. There was no notion of transaction in Prolog.
"meta-interpretation" of SQL statement amounts to looking at the statement type, whether it is an INSERT/DELETE/UPDATE or just a SELECT query.
There is no much complexity in the SQL syntax. My first experience writing more or less non-toy parser was to parse SQL schema, it was 20 years ago. SQL is not harder than VHDL with its context-sensitive lexing phase, where you can't answer is '(' a character literal or is it a part record type value construction. Maybe, DELIMITER can bite you, but, again, it is not hard to parse with parsing combinators.
The difficulty in figuring out all this syntactic weirdness is probably overstated for small queries, but it's there. Most attempts to write a better SQL I've seen succeed in narrow terms, but if the goal is to replace SQL rather than write a better syntax which (almost) no one uses, that's not enough.
No, but you can write an UPDATE/INSERT/DELETE that returns a result set like a SELECT.
Under SQL Server, you can attach OUTPUT clause to INSERT/UPDATE/DELETE/MERGE and effectively have a query that both modifies and returns data. Other DBMSes have similar capabilities.
Meanwhile Prolog does not even have a notion of transactions.
SELECT * FROM Foo WHERE Bar != 42;
SELECT * FROM Baz WHERE x > 42 ORDER BY y DESC LIMIT 10
For the second query, an index on (y, x) is very good most of the time. You would use the natural sort order of the index, and prune x <= 42 rows from the result set by only using the index. If almost all of the table is x <= 42 it's better to index on x, have it sort everything and throw away beyond the limit. Although if you are specifying a limit just for pagination, then it's decent to get e.g. the Nth element from the y sort order, and then only query where (y > ? and y <= ? and x > 42), possibly with a different limit and some other clauses if y is not unique.
I feel like these types of queries are basically non-existent in OLTP though. If you are going to block expensive queries, knowing the queries ahead of time makes it a lot easier.
The second query feels pretty normal in any system that has filters and sorting on a collection.
SELET * FROM Baz WHERE x > 42 ORDER BY y DESC LIMIT 10; a per-row index on x should in theory seek all values above 42 rather than perform an index/table scan, and the ordering and row limitation would be processed last i.e. in the SELECT. Again, not certain though.
SQL Server already has the missing_index_stats DMVs which can suggest indexes based on previous query use, and generating frequency diagrams / histograms based on previous queries run is relatively straightforward and would contribute to any index suggestion engine.
For the first, if the 42 value is fixed, you can actually make that query quite fast. In MySQL you could create a functional index for the boolean condition and quickly find all the rows that don't meet that criteria. Basically all the rows are sorted at insert and your query would take ms (constant time).
For the second, even if the 42 is not fixed, you could create a composite index for X, and Y DESC and get very acceptable results (log time).
In the second query, a composite index over (x, y) basically does the same job as an index over just x. The general query plan is to iterate over all values > x and keep the 10 smallest values around in memory to satisfy the ORDER BY.
The point is that there are many queries that seem deceptively simple but are actually extremely hard or even impossible to automatically index. Those queries which result in table scans, large index scans, and large sorts can easily dominate all the other queries which are easily indexable in terms of resource usage.
However we ran into the problem that users still made many queries which were essentially impossible to index. On many of the DB nodes, the resources used by these unindexable queries dominated the resources used by all of the easily indexed queries.
I left the experience thinking that there really wasn't any good solution to the problem for general users other than making the query language force users to only make queries that can be obviously satisfied by common index types.
Absolutely, although in modern orgs I'd expect this to be the domain of backend engineers rather than "a DBA", a mythical creature I've never actually seen in real life.
The joke is:
Q: What is the difference between a DBA and a terrorist?
A: You can negotiate with a terrorist.
After being a DBA, I moved into security, so I could tell people NO with even less consideration and more cruelty.
Grumpy old men complaining no one in the company is able to write good enough queries.
We got rid of them when we migrated from Oracle to AWS Aurora.
SELECT foo, bar, baz FROM quux WHERE foo = ?;
SELECT foo FROM quux WHERE foo = ?;
more frequently than the former query, just creating an index on foo alone may be the better choice. Whether it does depend on lots and lots of factors such as query distribution (is it worth it to make the monthly report twice as fast at the cost of running many other queries 1% slower?), disk speed (if you move a database to SSD you should revisit all indexing choices), and disk space.
With advanced SQL engines, there’s way more choice. Maybe, it’s best to create an index on the first n character of a string field, or an index that only works for equality searches, not for phrases such as foo < 3, etc, or an index on only the data from last month (can be done in some databases with partitioned tables)
> SELECT foo, bar, baz FROM quux WHERE foo = ?;
Possibly a typo in the second query since the text indicates it should be different from the first?
What a "good" index is depends on the use case; for example for a webapp a 0.25 second query that's run on every pageload is extremely slow, and creating an index for that is almost certainly a good trade-off. On the hand, it's usually fine if some analytical query that only your finance people need once a month runs for a minute; in that case, an index is probably a poor trade-off. Or: in my app one query takes about ~5 seconds for a rarely used feature; I could add an index to make it faster, but it would also eat up ~25G of disk space (last I checked, probably more now). The very small number of people using it can wait an extra few seconds.
It's not too hard to automatically create indexes; I'd say it's almost easy. But these kind of judgements based on use case isn't something the database can do, and you're likely to end up with dozens of semi-useless indexes eating up your disk space and insert performance.
This is probably one of those cases where automating things will lead to problems and confusion; it's probably better to spend time on tools to gain insight in what the database is doing, so a human can come along and do something that makes sense.
That is an empirical question and another comment points to this working just fine: https://news.ycombinator.com/item?id=31991469
Even ignoring that, creating indexes is a matter of tradeoffs - it's not the index creation that's difficult but the decision on whether the tradeoff is worth it. Seeing this, possible issues arising from automatic index creation can be mitigated by allowing the admin to set parameters that dictate where the tradeoff should be made (as a naive example: "I'm okay with using 10Gb of storage if it results in a 20% improvement for P95 queries on this table").
This is not an unreasonable idea and it sounds like it would greatly improve UX for devs working on median-sized DBs (people needing FAANG scale can manually tweak their DBs). Worst case, you can have a whitelist approach to automatic indexes, where the admin is shown index suggestions which require manual approval.
But that should be automatable. If the DB sees frequent queries, it would create indexes, if those queries cease, it would drop them. If queries are infrequent, it would not create an index.
Optimizing physical layout presents new challenges:
- it takes time to build those layouts and consumes resources. nothing like adding a load to a database that's already struggling to keep up with the load :)
- you don't know what you are going to break
With all that said that's the future and the industry has to figure it out.
Devs struggle to plan for read/write workloads. Making more indexes can then make the same inbound volume from the previous day require n more writes per row.
You can still get surprised by query optimizations on relatively simple joins just changing query strategy when hitting tuning parameters.
Making analyzers more visual and giving lots of suggestions (that could then be ignored) would b a good intermediate step.
I imagine there is a large bias about caring or not about this depending on what DBMS you are used to.
It just happens that on some DBMS you get "surprised" as "oh, shit, this problem again", and on others it usually comes as a surprise as "wait, I never thought about optimizer limits". The second set is much more prone to automating.
Most modern databases do auto optimization, stats, and index tuning.
Though someone might also see the mention in a thread. There's no specific notification value in tagging mods or admins howevers.
I think the company Snowflake Computing was founded on this premise?
Indexes compete with each other and with table data for space in memory cache(s). They also require significant cpu and IO resources at write time to be kept up to date. Having too many indexes can lead to having them not being used (best case) or having them evict each other or more important data over and over from caches. In many cases it is better to not have one index that helps some specific query perform 50% better if it means another index is able to stay in memory that makes some other run in 1/500th as much time. It's often better to just scan data and have queries take an hour if they are for nightly reports rather than create the enormous multi-column indexes required to speed them up, since you don't care about the reporting queries taking an hour over night (at least in this hypothetical).
The problem is that in such a scenario, there is no requirement or even recommendation for query writers to structure queries with regard for existing indices. You will rapidly end up with too many partially-overlapping, including the overhead to maintain them.
You will probably want a "what query uses this index?" feature, that is the other side of the coin of "what indexes make my queries faster?"
Having been there, done that, I can assure you more indexes are not always your friend.
Imagine how much less predictable your database becomes when it's trying to be "smart". I wouldn't want to be oncall for that thing.
then why not do it yourself?
or a lot of other bad things?
I played with something like that in Python a couple of years ago, but never made it production-ready. I think in many simple cases it could replace a dedicated (SQL) database.
The other reason is that most often you only need an index on one field, and don’t need the elements to be in a particular order, so you just use a map/dict with the value of that field, instead of a list.
The index is already a "simple in-memory list", sorted. That's why it is expensive to mantain.
> It would be great if you could have a std::vector or Python list of plain old objects
Sort said Python list by the columns in your index.
> and then tell it to add an index (using a hashtable for example) on one or two of the fields.
Hashtable is another thing entirely, and designed for fast key retrieval at very high memory and CPU cost. If you can't afford a index you can't afford a Hashtable/dict either.
The added benefit is that you can also dump it to the drive if you have to prematurely end your task and resume it from the same place later. And it also supports indexes, both traditional and full-text searching.
It's a lot easier to use a tool designed for the task instead of building one from scratch.
This is what a LinkedHashMap (1 index) and HashBasedTable (2 indices) do in the jvm-land. Admittedly, there isn't a convenience api to convert a given collection/vector to a map/table (but hey, if you're like me and write a lot of golang, then the current jvm-land status quo is already heaven!).
That's probably why such a library doesn't exist. Sorting & search is good enough for one field. Full scans aren't that bad for small/medium datasets. And databases aren't that difficult to integrate into an application.
Also, pandas exists in the Python world.
Pandas is also very powerful, but everytime I go to use it I feel I have to learn the API all over again.
I think Prolog (another commenter pointed it out) and datalog do this to some extend.
If you didn't care about disk space and the db could do async indexes but not have stale reads then an automatic index system could make sense.
Due to the multivariable nature of the this problem ML is the right approach to solve it and the ability to "prove" that it will work is also key.
My issue is that you cannot (easily) delete the automatically generated indices, should you want to do that for whatever reason. Of course, Oracle will try to only enable the indices that will lead to actual performance gains, but when it comes to removing the automatically created ones altogether, the best that you can do is:
DBMS_AUTO_INDEX.CONFIGURE ('AUTO_INDEX_RETENTION_FOR_AUTO', '1');
The problem is that you can't easily get rid of all the automatically generated ones so that you could test this migration against the same environment (basically create those same indices through your own SQL).
Here's an excellent interview with the creator: https://corecursive.com/030-rethinking-databases-with-jon-gj...
HypoPG is a PostgreSQL extension adding support for hypothetical indexes.
An hypothetical -- or virtual -- index is an index that doesn't really exists, and thus doesn't cost CPU, disk or any resource to create. They're useful to know if specific indexes can increase performance for problematic queries, since you can know if PostgreSQL will use these indexes or not without having to spend resources to create them.
With one approach to actually implementing it here: https://www.percona.com/blog/2019/07/22/automatic-index-reco...
https://github.com/ankane/dexter is an autoindexer project that builds on hypopg, and I believe it can be configured to autocreate indices — it will only suggest them by default.
For context, I've personally been working on an automatic indexing system for Postgres for more than a year now , and whilst I think what we have today is pretty useful, there is still work to be done. We've intentionally not yet enabled full automation (i.e. actual automatic creation of the indexes on the production database), because most people I've talked with prefer to review a recommendation and then apply it through their regular migration tooling, after making an assessment.
If you ever want to talk more about this, feel free to send me an email - I've spent a lot of time thinking about this topic :)
I think the enabling technology for this is branching and the ability to "fork" a database AND the workload to prove that adding an index doesn't break things.
The only reasonably ready "Kubernetes" project I saw attempting to solve this was SUSE/Rancher longhorn, and that was still somewhat incubating last time I checked.
The feature would need to also keep track of which indexes are unused and remove them over time to avoid the massive write amplification that would come from building a new index for every query that crossed the threshold over time.
I don't know how things are in Google Appengine land now.
SARGability of the queries is also needed to effectively use the indexes.
What if such field is a blob with a uncompressible 12mb blob of a image of a index card?
What if indexing them would waste several TBs of RAM? That is not a good default.
On a more realistic scenario, it is reasonably rare that you will only ever search using the PK, and no other constraint, so in a way is a wasted default index.
Dentistry wants to remain in the dark ages, and so does the database world. The rest of us just want something like the electrical (non-Texas) grid. Just works. When it doesn't, it comes back up, and all is fine.
The electrical grid is kept up by the continuous effort of tens of thousands of employees across the country btw, it's hardly magic. I bet most people would be horrified if they realized how much manual planning is involved in keeping the electrical grid running.