The problem isn't too dissimilar from the challenge that tracing profilers try to solve. One key difference, of course, is that dynamically adapting the physical storage strategy to usage patterns is relative expensive, resource-wise. But not every problem needs to be solved for those users with 1 billion rows. I would argue that most applications — web apps, anyway — have datasets that are large enough to optimize, yet still small enough that a dynamic, self-tuning system could be entirely feasible.
The resource problem can be mitigated in several ways. One is to off-load processing horizontally, so while a master is chugging away, a slave is building a table using a different strategy. Another is to simply perform expensive optimizatons during off-peak hours.
Also, I find it interesting that databases also still operate in a mode where reads and writes happen in the same physical store. I have an idea for a database that separates the two: You have the transactional, highly consistent, non-horizontally-scalable OLTP-style "transaction store" where a client carefully constructs reads and writes against the absolute truth, but where reads aren't super efficient; and then you have the "query store" which is a read-only, highly replicated and sharded set of mirrors that uses highly optimized data structures that are created on demand based on the topology of the data and usage patterns (in particular, the same data can potentially be indexed in multiple ways (B-tree, compressed columns, etc.) until the profiler finds the best strategy). The mirrors' replication should be tunable so that some "hot" subsets can be immediately consistent, and others can use slower batch transfers to reduce lock contention on updates.
Right now, when people build scalable systems, they often apply this design by putting the truth in PostgreSQL or whatever, and then indexing it into Elasticsearch or Solr. But there are huge benefits to combining the two into a single, unified system.
And if we are talking about serious need for indexes, then we almost certainly are talking about subtle tradeoffs that are unlikely to be made like we hope by algorithms that optimize aggregate metrics (eg. average/median query time, cache efficiency etc.) -- consider complicated reporting queries that don't run often, but take a while (but since they are seldom run they are not influential on the databases' performance metrics). A "smart" DB could decide to drop an index, because it accelerates a common write operation, and no important queries needed that index anyway. But now your complex reporting queries are all turned into nested full-table scans. Oops...
It's easy to lock an optimization plan in place. It's absolutely trivial, so there's no reason to be afraid of your DB reorganizing itself. Just tell it not to if you are satisfied with how it runs.
The problem is that people tend to need a ton of expertise and awareness when they restructure their data manually. Did they build the right indexes? Well... the DB doesn't know. It just builds what you tell it to. Even if it could know how to optimize access, if only you gave it an example query rather than a directive over its internal structure.
This is the advantage of using declarative language in lieu of imperative language. The DB is supposed to be a data manager. You should say "I want to run these queries, and I have this much memory for you" and it should know how to get your results as fast as possible.
There's also maintenance views that can be queried to see if SQL Server is detecting any missing indexes, or if there's any indexes that are not being used - but again these have to be taken with a grain of salt, as the stats get reset each time those indexes are rebuilt.
> Did they build the right indexes? Well... the DB doesn't know.
The DB does in fact know - looking at the query plan for a given query will outline if there are any obvious missing indexes - but having the DB go ahead and build indexes by default for every given query is dangerous. When you go through 10 iterations of an ad-hoc query refining it down to what you actually want, how's the DB to know that the first 9 don't count? Etc..
> You should say "I want to run these queries, and I have this much memory for you" and it should know how to get your results as fast as possible.
That is in essence what the query optimizer attempts to do, though it has breaking points where it will settle for 'good enough' because optimization of the query itself becomes too expensive. Sometimes people just write bad queries.
Because honestly, the only reason any user cares that an index exists is if they need it for performance. It should be an implementation detail. Something better than an index might come along, and we've all got to run victorian-age schemas because we've written "create index" all over the place.
What about different compression schemes for different sets of "column" values - with the idea that you execute against the compressed values, rarely ever decompressing. Even column store systems that allow you to specify a compression scheme aren't adaptive. Why the hell not? This is almost 2020.
I assume you mean once process/aggregate a billion records. Having a billion records in a single table on a single, non-distributed system is not an issue at all.
Uhh, a billion-row table you can handle on a laptop easily. Probably in SQLite.
Yeahhh.... no. I remember running a 250-300,000,000 row postgres table of Wikipedia links on my 2010 Dell XPS and it functioning fine, if not a bit slow (it had a huge index on an array column). Good machine, but ancient now.
Regarding your point about reads and writes on the same store, you might take a look at the new "HTAP" systems (hybrid transactional and analytic). The VectorWise guy has a "new" (probably old by now) project called HyPer
I think that’s because that only would be truly beneficial once you have huge tables, and you really wouldn’t want your database to suddenly decide “let me store this table order by its first column, as that seems to be faster for the queries I run now” if that choice leads to degraded performance for hours while the database reorganises tables, creates indexes, etc.
And it can get even worse; that decision might make the queries it runs _now_ go from “more than fast enough” to “way more than fast enough”, but also slow down the queries you run infrequently from “will do” to “glacial”.
I think we would be better of with a database that can easily generate a report showing queries run in the last months, with suggested schema changes and their estimated impact on performance. SQL Server has the query analyzer, but I think that could be taken much farther and be made easier to use.
Edit: I see that, while I wrote the above, jws wrote that Digital's RDB had something like that.
It's still quite hard to optimize queries especially if there are numerous JOINs involved. For reference, please check Dr. David DeWitt's PASS Keynote slides http://www.slideshare.net/GraySystemsLab/pass-summit-2010-ke... and his additional papers on database systems: http://gsl.azurewebsites.net/People/dewitt.aspx.
I can't find the video of the keynote online, I believe it's only available to PASS Summit attendees, but if you can find someone with access it's a superb talk. And he didn't even address parallelism or distributed processing as optimizations. Microsoft has already added a new cardinality estimator to SQL Server 2014 and it's still not perfect.
"In 2016 we're still manually creating indexes, and databases still use the exact same physical data structures for everything they store, no matter if they're storing 1,000 rows or 1 billion."
As of SQL Server 2014 there are 3 different physical storage methods: row store, columnstore, and in-memory hash table store (formerly called Hekaton). Each type of storage has completely different optimization methods and are best suited for specific types of data and queries. Dr. DeWitt's page has additional detail on those, and you can watch his PASS Summit Keynote on Hekaton on Youtube: https://youtu.be/aW3-0G-SEj0?t=1336
With all this in mind, self-tuning is going to take quite a while to get to something that provides consistent improvement over detailed analysis and testing. And new technology can completely change how one would approach optimization: https://blogs.msdn.microsoft.com/bobsql/2016/11/08/how-it-wo...
Sigh. In the real world (meaning 99+% of applications) there are a few big sets of data and lots of smaller sets of reference data. The thing that matters for performance is reducing that cardinality of the big stuff as quickly as possible. Which is a huge constraint on the search space.
Sure, in some theoretical sense, join ordering needs an A* algorithm or simulated annealing or some other esoteric approach to search. But in practical sense, it is almost a non-issue.
And here's the thing. Those systems with advanced search and costing for join ordering STILL make incredibly stupid decisions more than 3% of the time. And a bad decision can take hours to days and dominate all the good decisions.
DeWitt made incredible contributions to database architectures. He's a great guy and I'm a huge fan. But that was 30 years ago. Where are today's DeWitts?
He just left Microsoft's Jim Gray Labs 1-2 years ago and is now at MIT. He also led the teams that developed Hekaton and PolyBase for SQL Server in the past 8-10 years. He's still making plenty of advances.
However, I think this (very recent) research paper is a much better peek at how self-tuning might work:
All the mainstream commercial DBs will happily generate you a script with self-tuning advice, you just need to read it and if you agree, run it.
We're talking much more than just index optimizations. Think automatic OLTP vs. OLAP pattern detection and migration of data into columnstores or in-memory tables. Automatic detection of cold vs. hot data and migration to-from expensive fast near storage (SSD) vs. cheap remote storage (S3 like). Automatic resizing of instance size. Materialized views. On-the-fly query rewrite. The field is wide open. And you have the best experts in the field looking at your database.
By contrast, the traditional on-premise auto-tuning did not made much progress over the past 10-15 years. There are reasons for this: small operational dataset (one site only), high risk of regression (difficult to do A/B testing), scarcity of experts. The best avenue for on-premise I predict will be the trickle-down effect of migrating service know-how and technologies to on-premise offerings. We're also going to see a push up the stack of the know-how and experience from operating the service making its way into the app stack (think EntityFramework v. 2018?) and helping developers avoid the common pitfalls.
BTW, only 3 days ago Adrian Coyler's morning paper discussion was Self-driving database management systems 
It wasn't as good as a really good expert, but it was at least a 90% solution.
As I recall it was pretty good at getting a tuple and its one to many associated tuples into the same disk page so you needed fewer disk reads to access data. These were disks that were capable of about 35 seeks per second, so it really mattered.
(The drive I'm thinking of is the RA90. 1.2GB in a half of a 6U or so space. 32kg. 280 watts.)
It's pretty fantastic.
As with any suggested indexes though, they have to be weighed against their effect on other workloads. Additional indexes can have a big impact on inserts / updates / deletes.
He goes into great detail to explain how exactly SQL indices work and how to leverage this to write queries.
Yep. It still amazes me how queries can go from absolutely great to scanning tables with a row added in the wrong place. WITH RECOMPILE still scares the heck out of me. It can really help when distributions change, but it can also kill your performance with a bad query plan choice.
It addition to the hint show, you can also force certain indexes. This didn't seem to be as big a problem on SQL Server as its parent code Sybase ASE. Sybase made some really, really bad choices on big tables.
No amount of optimization will fix a bad schema design.
This is also a field where talking to people about how you built your machines and where the indexes are located can make a world of difference. If you cannot have the whole database on the fastest storage, then at lest put the indexes there.
I think that this is a bit of an oversimplification, though I understand the sentiment. Statistics skew is one of the primary reasons for such a dramatic shift in query planning and adding a row can create such a problem if you're already at some sort of skew tipping point and that one row pushes you over the edge.
It would be nice to know more about where the edge is and how to plan to ensure you don't drive your application's/server's performance off the cliff, though.
No, it really isn't. Cache / Memory isn't a sliding scale, and once you exceed it you fall off the cliff. Databases queries can fall off a cliff if you exceed your memory. I've watched a query that was executing in a couple of seconds drop into many minutes territory because it had to start paging.
There is a real machine with limited memory running your queries and it has been my experience that warnings are rare.
That's really why I recommend SQL writers really study the plan generated. Its the closest thing to studying assembly to teach what the optimizer is thinking.
I still think the old Ingress plan viewer was pretty neat.
As I thought through it, I realized you can get a good sense of statistics skew by getting the counts and distinct counts of each value for your keys. If you have a lot of variation, then you may be able to reasonably expect query plan variations.
Oracle has a bit more data on their website..