

Speeding Up PostgreSQL with Partial Indexes - drob
http://blog.heapanalytics.com/speeding-up-postgresql-with-partial-indexes/

======
infogulch
What about one multiple-column index? Using three separate indexes on three
columns is almost never ideal so I don't know why it was considered in the
first place (unless to make a contrived "normal" to compare partial indexes
against).

My guess would be: nearly the same read performance as the partial index, a
bit less than half the storage of three separate indexes. The write
performance isn't as "free" as the partial index, but it's a third of three
separate indexes and still allows for many more types of queries than the
partial index.

~~~
drob
Separate single-column indexes can be reused more readily. In this scenario,
you might have a few different types of "high value" events, and you'd want
your indexes to be applicable to more than one of them.

A multi-column index on (a, b, c) can't be used for a query that filters on b
and c but not a. More detailed documentation here:
[http://www.postgresql.org/docs/9.3/static/indexes-
multicolum...](http://www.postgresql.org/docs/9.3/static/indexes-
multicolumn.html)

~~~
infogulch
If you have no clue whatsoever about how your table could ever possibly be
queried then sure, three separate indexes would probably be the best way. But
most tables are not queried on all possible combinations of columns, but
instead have regular query patterns.

Adding N 1-column indexes for N columns on every table is at best is a lazy
cop-out for actual data and query analysis.

Edit: to address your example specifically, you'd probably be better off
having two indexes, one on (a, b, c) and another (b, c) instead of three
separate indexes if that covers your query patterns.

~~~
drob
This is a scenario in which you do know the query set ahead of time, and
there's still a considerable performance benefit to using partial indexes over
any combination of multi-column indexes. The main insight is that the "high
value" events are a tiny slice of the table -- about 0.05% of rows for each
type of event.

Let's say you care about ten different types of "high value" events, which
reference a total of six columns. I'll assume we can cover them with three
multi-column indexes, although it wouldn't be hard to cook up a realistic
scenario in which you'd need more. That means, for each INSERT/UPDATE, you
need to write to three different indexes.

Given that the event definitions are selective, a single partial index
requires a write for about 0.05% of INSERT/UPDATEs. Ten of these will
cumulatively require one write on ~0.5% of inserts -- a 600x improvement over
the conservative estimate above. That is, the cost of of maintaining the set
of multi-column indexes should be much, much higher than the cost of
maintaining the set of partial indexes.

As I mentioned in the article, the partial index approach also allows a more
flexible set of predicates. What if you want to index for rows with a field
that matches a fixed regex?

Edit: Apparently I can't respond to your response -- does HN have a chain
length limit?

Yes -- for indexing the single event definition used for the profiling here, a
multi-column index would absolutely be preferable to three single-column
indexes. I didn't think to include the option because I assumed it wouldn't
scale to a case in which you have ten different event types that use an
overlapping but not identical set of fields. Definitely would have been a good
idea to include a comment to this effect in the post.

~~~
infogulch
@Your edit. (I don't think there's a nesting-limit, but a minimum delay before
you can reply.)

That makes a bit more sense. I would challenge your phrases "realistic
scenarios," "assumed it wouldn't scale," and "overlapping but not identical
set of fields," but the real problem we're facing in this thread is all our
scenarios are hypothetical and simplified. The fact is, indexing decisions are
highly dependent on the exact scenario and since we have different contexts
the best we could do is banter on about the "most accurate" hypothetical
scenario, which is nonsense.

So I guess my conclusions would be "highly selective partial indexes could
save you a bunch of write io" and "otherwise, It Depends™ because Indexes Are
Hard®". In any case, your article is a decent introduction to partial indexes
in general.

~~~
drob
Yep, looks like a minimum delay. Interesting!

Agreed in general. No one-size-fits-all answers, and schematizing your data
well is going to require case-by-case attention and experimentation for the
foreseeable future.

------
chdir
In case of Django, the RunSQL command [1] in migrations makes it very easy to
use partial indexes. There are a couple of examples on Stackoverflow.
Definitely a low hanging fruit and worth those 10-20 minutes to setup.

[1] [https://docs.djangoproject.com/en/dev/ref/migration-
operatio...](https://docs.djangoproject.com/en/dev/ref/migration-
operations/#runsql)

------
dsugarman
this is a really cool feature I did not know about. just curious, why do you
use such crazy json in postgres? why not store the data with relational
database standards?

~~~
mcorrand
DISCLAIMER: I'm not a strong dba, but I'll try my hand at an answer. Please
let me know if I'm talking nonsense.

It would be useful if the data meets two requirements:

-you know that it won't be queried in a relational way, but

-you don't know how the schema might evolve (beyond the fields you use in your partial index, that is) and want to save yourself the hassle of frequent schema migrations.

~~~
calpaterson
I've been on a project that used postgres' json type as it's main data
storage. I wouldn't go down that route again, even now that jsonb is
available. My thoughts:

It doesn't really make sense to say that a query won't be relational. Few
result sets are ever relational - if they include a JOIN or really any
grouping, sorting etc, they are a non-relational projection of your data.
That's the whole point of a relational database - relational storage and ad
hoc non-relational views of it.

If you don't need to migrate your data when you change the schema (ie: nothing
has gone live yet), just truncate the entire database every time you change
schema. No need to write migrations and you get to iterate towards a good
relational schema.

If you do need to migrate your data when you change the schema, do not under
any circumstances choose JSON types. Existing migration tools like Flyway,
Alembic, etc only support migrations between relations and you will have to
write your own, custom, half-baked and buggy schema migrations. Migrating even
relational data is hard and migrating non-relational data is much harder. You
won't have the support of postgres' fantastic transactional DDL so it will be
hard to do migrations without downtime. Probably the only approach that will
work is to migrate data as you touch it.

Even if you decide to go for JSON, make sure you understand the limitations of
the JSON type in 9.3 and lower. Equality isn't defined on JSON (you have to
coerce to text first) so lots of basic SQL features won't work - for example
UNION or EXCEPT. 9.4 has JSONB which is a lot better.

~~~
laurencerowe
I'm also using the Postgres JSON type in 9.3, but my experience is different,
perhaps because I come from the ZODB world and so am not under any illusions
that it is a panacea!

I've ended up with a very generalised schema, all content is in JSON. I've not
had to change the Postgres schema for over a year and it's easy to upgrade
schemas inline while fetching from the app. We did write our own migration
tools, but they're only a couple of pages of code.

I don't understand your complaint about lack of transactional DDL when using
JSON. It's still there, though probably not as important as most of your
upgrades will only touch data, giving you transactional upgrades (should you
desire them) in any database.

DB queries of JSON data are definitely slow in 9.3 as the values need to be
parsed every query, but the lack of JSON equality has not been a problem. You
simply use the double chevron text extraction operators rather than the the
json extraction operators in your query, col->foo->>bar rather than
col->foo->bar. From there you can cast to any type and join to other tables.

For us, the big advantage of using a generalized schema has been simple
support for recording versions of data. With over 30 different content types
doing this with a traditional relational schema would have resulted in maybe
90 tables. Our generic schema has six.

I'm really looking forward to 9.4 as it will make ad-hoc queries across JSONB
really fast using GIN indexes. I'd love to see the option of creating check /
foreign-key constraints on JSON data in the future. Unfortunately it looks
like the proposal for this is stalled as the implementation turned out to be
rather tricky. [http://blog.2ndquadrant.com/postgresql-9-3-development-
array...](http://blog.2ndquadrant.com/postgresql-9-3-development-array-
element-foreign-keys/)

------
losvedir
This is a cool feature.

Unfortunately, when I looked into this for our codebase, I found that it
doesn't have full support in Rails 3. You can make the index, via a migration,
but rails won't include it properly in your `schema.rb` file. So restoring via
schema (as opposed to running through all your migrations from scratch) or
just trying to understand your database by looking at schema.rb won't give you
a full picture.

However, looking at `add_index` in rails 4, it seems to support it!

edit: one thing I did try out was you can change your schema file from
`schema.rb` to `schema.sql`, which supposedly fixes it, but I had some issues
with that which I don't remember at the moment.

~~~
sanderjd
There's a simple solution to this:

    
    
        config.active_record.schema_format = :sql
    

In my opinion, this should be the default. I'm really not sure what advantage
the schema DSL in schema.rb has over the ability to express the full schema in
its native tongue.

Edit: Just saw your edit. This[0] may have been the problem you had with using
:sql as the schema format. The counter-intuitive workaround is to output ruby
in production. It seems weird, but doesn't hurt anything.

[0]: [http://stackoverflow.com/questions/17300341/migrate-not-
work...](http://stackoverflow.com/questions/17300341/migrate-not-working-on-
heroku)

~~~
ikawe
> I'm really not sure what advantage the schema DSL in schema.rb has over the
> ability to express the full schema in its native tongue.

I always end up using the sql format, because sooner or later I do something
remotely non-trivial in postgres. But I'll try to answer your implicit
question...

schema.rb works just fine for sufficiently simple schema, it's easier to make
sense of in version control, and it's backend agnostic, so if someone wants to
develop the app on sqlite - that's just fine.

Of course, all these things become a downside when your setup is sufficiently
complex - but if you are not using any interesting database features it's
fine. In particular, I would almost never recommend developing against a
different DB than you are deploying on. Sooner or later you'll be bit by a
behavioral difference.

