

Introducing HypoPG, hypothetical indexes for PostgreSQL - narfz
http://www.postgresql.org/about/news/1593/

======
ninkendo
So presumably you could do the following on your entire database:

    
    
        for each query in my codebase Q:
            for each possible combination of indexes I for Q:
                create hypothetical index I
                run EXPLAIN query Q
                if Q uses I, create I
    

If EXPLAIN doesn't actually run a query, and creating hypothetical indexes
doesn't actually write anything, the above could presumably done in
milliseconds, no?

Why not take this one step further and just enable an autoindexing mode in
postgres using the query history as a heuristic?

~~~
developer1
For the typical larger project, you'd wind up with hundreds of indexes each
used by a single application query. The optimal index arrangement for a
database often does not result in all columns being indexed on a query. It's
normal and desirable (for index size and performance) to have fewer indexes
that may only partially cover your queries. For example, a query with a WHERE
of "a AND b AND c" is more often than not better handled with an index on (a)
or (a, b) rather than (a, b, c). Especially if you have a second query with "a
AND b AND d", you're most likely going to wind up with a single index for (a,
b) rather than one for (a, b, c) and another for (a, b, d).

What could be useful would be a generated report for all your queries with
possible suggestions, grouping together queries with the same partial prefix
columns, and the cardinality that each index would result in. Then one could
manually review to look for the most significant improvements possible. I
think a fully automated attempt (even built-in heuristics that modified
indexes over time) would sometimes result in a bad decision being made that
could theoretically crash your application. Perhaps with a few years of
perfecting a solution this could become the norm. :)

------
jedberg
Combining this with just a bit of programming loops could get you pretty far
at automatically creating indexes. Basically take a query plan and then create
all possibly indexes hypothetically until you get the most optimal query.

~~~
timdorr
Except PG index usage in the query plan is dependant on stats about the table
(row count, width, etc). So, you would need to continually run it against
production data and have it forever chase the perfect set of indices.

If you did end up making it, it should be named moby_dick.py.

~~~
CaveTech
Or you tune for the best average execution time. Sample your queries and run
them all through each query plan. One has to be optimal, unless all changes
are negligible.

~~~
Jweb_Guru
Even for a single query, the optimal query plan changes over time with the
distribution of data in the table (and in other tables), as well as things
like database load. There literally is not one best query plan, even with
perfect information. Also, keep in mind that an abundance of even fake indices
like this can have adverse effects on planner results by forcing it to
consider more (and therefore, more suboptimal) cases with its limited budget.

~~~
silon3
And that's why people use NOSQL. It's much more predictable in production use.

~~~
Jweb_Guru
Er, no. It isn't. NoSQL solutions have exactly the same problems. Most just
don't _have_ query planners or statistics capable of understanding how to
properly deal with the ever-changing data (and also frequently don't even have
the indexing capabilities required to do much more than a single get or set
efficiently, anyway--these problems tend to come up with larger queries).

------
dunkelheit
Cool one! Thankfully postgres allows creating indexes concurrently (without
blocking writes) but 'hypothetical indexes' seem more convenient.

What would be even cooler though is the ability to see multiple plans
considered (and rejected) by the planner with all associated costs. It often
takes a lot of painful guesswork to understand why particular superior plan is
not used.

------
makmanalp
I'd always wondered why query engines don't create such temporary indices, for
example in the case of large subquery queries, where creating and index AND
doing the query with an index runs way faster in total than waiting for it to
run without the index. Any insights on this? I guess predicting the gain could
be difficult.

~~~
MSM
There shouldn't be a situation where reading the data + creating the index
would be less intensive than just reading the data.

This is from the SQL server perspective so I apologize if all this doesn't
translate 100%. Let's say the column you're filtering on is currently not in
any index. That field then needs to be pulled out of the clustered index's
leaf pages. Since it's unsorted, we'd need to read the entire table- a full
scan.

To build the temporary index we'd also need to read in all of the data, so the
same amount of work from that perspective, but then we'd have to sort it all,
which is very intensive.

If there's a situation where we're doing something analogous to a inner loop
join in SQL Server, but it's doing full scans for each iteration, that's a
problem with the query optimizer.

~~~
btown
Very true. That said, there are almost always situations where reading the
data + creating the index + using it N times would be less intensive than just
reading the data N times, for _very_ small N.

I'd love to see Postgres or MongoDB with an index-suggesting logger - I'd pay
the cost of the logging to have it run on every query, then a background
thread would figure out what the "hot spot" types of queries are, figure out
what indices to create concurrently, and show it to me in a web interface
where I can click a button and create the indices I need concurrently during a
low-traffic period of time.

Does anything out there like this exist?

~~~
MSM
SQL Server has all this information saved (I'm not paid by Microsoft, I
swear). I would suspect something is available for other engines as well.

For example, I can easily grab (what it thinks are) the top 10 most impactful
missing indexes and create them. The big issue is that, well, they can be
really dumb when looking at a macro level. Do I often query a 10 column table
for 8 of the columns? It will likely recommend I index those 8 columns. In a
vacuum that might be a great index for this specific query, but now I'm
basically maintaining two copies of the full table.

To be able to weigh these pros and cons would be a really nice feat, then DBAs
will have to abandon the "It Depends" mantra!

~~~
pizza234
I think indexing is ultimately a trade-off design; the end effect of it is a
decision which belongs to a human - what do we make faster at the expense of
what?

Probably the best thing a "machine" can do is to give tools to easily perform
a reasonably accurate and extensive analysis, but I think any major RDBMS has
"good enough" tools to do so.

Interestingly, I think it's important to think who is the target of such tools
- the casual developer? the novice dba? the expert dba who works on large
datasets? I think even theoretically advanced tools would help only the first
case; the other two _need_ to know very well what they're working with.

------
michaelmior
This is quite excellent for those of us doing research in physical schema
design. As far as I know, this is the first "what-if" analysis tool integrated
into an open source database. Although if there are others, I'd love to know
about them :)

------
willu
One of my major hang-ups about transitioning away from SQL Server to Postgres
is the productivity that comes from tools like Profiler and Database Engine
Tuning Advisor (which uses hypothetical indexes to find performance
improvements). Can anyone suggest analogous tools for Postgres?

~~~
postila
You should a couple of years more.

There is a LOT of efforts being made these days (like
[http://www.pgcon.org/2015/schedule/events/809.en.html](http://www.pgcon.org/2015/schedule/events/809.en.html)
for profiling) and eventually we'll see stuff not worse than commercial DBMSs
have. I hope it will be soon.

~~~
postila
*wait

