
Running 10M PostgreSQL Indexes in Production - luu
http://heap.engineering/running-10-million-postgresql-indexes-in-production/
======
malisper
Author here. Feel free to ask me any questions.

By the way, the 10 million number in the title is outdated. It's now about 30
million indexes.

Also, I've been publishing one blog post a weekday on my personal blog about
one Postgres feature or component. If you liked this post, you should check it
out: [http://malisper.me](http://malisper.me)

~~~
lalwanivikas
Hey Michael, not directly related to your post, but I wanted to know how a
noob(like myself) can be good at Postgres?

I am a frontend developer who wants to start making(and deploying) full stack
apps. I have heard that Postgres is one of the most reliable relational
databases out there and I would love to use it in my apps.

Any suggestions on how to get started with Postgres? Or relational dbs in
general? Books/courses/making projects/anything else?

Thanks in advance!

~~~
skrebbel
I'm not Michael, and I'm not sure if this is the answer you want to hear, but
if you're a frontend developer the database is likely going to be the least of
your worries when you start. Making and deploying the backend is going to take
much more of your time than doing Postgres well.

There's a lot of blog posts on HN about bending databases like Postgres to
pretty insane workloads. Heap's workload is a nice example of that. they're
interesting blog posts but don't let them distract you. Most databases really
do work pretty well by default, if you just use them the simplest way
possible. Most apps don't have insane workloads, and especially when you start
out they won't.

If you just make a table for every "thing" in your system and link them
together with foreign keys you're pretty much set. Most popular frameworks
(Rails, Django, etc) do this for you without even touching SQL.

This approach is _fine_ for nearly every case. I bet even Heap started like
that. Famously, Twitter started like that as well, and the only reason that
got them in trouble (remember the failwhale?) was because they were growing
faster than nearly any platform ever before. Most sites don't 10x their user
count every half year, and if you do, lucky you. Scaling problems are luxury
problems.

Half of HN is jealous of people who have scaling problems. I sure am. The
other half, well, they won't have scaling problems because they're not
shipping fast enough. Their backends would scale though. They read all the
books on Postgres and doing it right from the start!

~~~
collyw
As someone who uses Django more or less full time, I don't think relying on
the ORM without knowing any SQL is especially good advice.

Django's ORM is very easy to use, but it is not especially powerful, and when
you want to do advanced queries (multiple join conditions for example) it
often isn't possible. When someone asks me for some data from the database,
usually its easier for me to go directly to the database than use the Django
ORM.

Use the ORM to start with, but its most definitely a good idea to learn proper
SQL .

~~~
pas
ORM is very much like hypermonadic supertemplate C+++++ type theory, you can
only play with it if you know what you're doing, but don't want all that
boilerplate (cognitive or in-code unit tests) in your head.

So, I think starting with PHP and raw SQL is a good thing for motivated
learners.

And if you need motivation, try building a WordPress site with themes and
plugins first. And maybe an Android app. And then something cute like a LED
blinker with a raspberry.

------
ajaimk
Google Cached Page:
[https://webcache.googleusercontent.com/search?q=cache:v7vWZ1...](https://webcache.googleusercontent.com/search?q=cache:v7vWZ1iYHawJ:https://blog.heapanalytics.com/speeding-
up-postgresql-queries-with-partial-indexes/+&cd=1&hl=en&ct=clnk&gl=us)

~~~
icelancer
"Running 10M PostgreSQL Indexes in Production"

Also, our website is down due to modest traffic.

~~~
d3ad1ysp0rk
Obviously they are using a hosted wordpress service, which has nothing to do
with their application and the content of the blog.

~~~
user5994461
wordpress.com is free and takes infinite traffic just fine.

------
alt_f4
One kind of significant downside of partial indexes in PostgreSQL is that they
don't work with parameters. Like, if you have a partial index on username_id,
this will work:

select * from users where username_id = 5;

but this won't use the index:

select * from users where username_id = (select 5);

Basically, the query planner won't evaluate the parameter before it decides
what strategy to use.

~~~
sorkin2
> One kind of significant downside of partial indexes in PostgreSQL is that
> they don't work with parameters.

Your example is fundamentally different from parameters sent by a client in a
parameterized query, i.e. $1, $2, etc. Also while your flawed example will
never use a partial index, parameterized queries can and often will do that.
With _prepared statements_ (which are a slightly different concept from
parameterized queries) the devil will lie in the details, but even those don't
really prevent partial indexes from being used since postgres version 9.2,
released years ago.

~~~
alt_f4
my `flawed` example is fine for what I'm using it for: demonstrating that PSQL
won't use a partial index in an obviously common use case. Feel free to argue
about the semantics of what is a prepared statement vs parametrized query vs
whatever, but the fact is that the PSQL planner does not evaluate the subquery
select prior to execution nor is it capable to alter its original plan after
execution of the subquery.

This makes partial indexes of EXTREMELY limited use for complex queries which
often need to pass a non-literal or do a join on a partially indexed column.

In fact, it is this EXTREMELY inconvenient limitation that often forces people
to rearchitect large tables around a set of smaller tables with inheritance
and check constraints, so that the planner will do index scans on the smaller
tables, rather than a huge index.

In fact, if partial indexes were properly architected and supported, it would
eliminate 90% of the use cases for partitions in PSQL. Partitions, by the way,
are also a sham in PSQL 10, merely putting some syntactic sugar over table
inheritance and what is basically a table-routing insert/update trigger. The
current planner support for them is a joke / non-existent / worse than roll-
your-own inheritance with check constraints and triggers.

------
vijayparashar12
Good to know about the partial indexes on Postgres, but I am little queries to
ask you as an analytics company your system is write-heavy system, how did you
justify increasing latency of writes, I am pretty sure that you have some kind
of queue in front of Postgres to overcome this problem, secondly what I
understand from the blog and nature of event-based analytics it's generally
time series data, What made you stick to Postgres where I can easily see lot
of different datastores build specifically to deal with time series data like
Elasticsearch and influxDB ? can you give me some insights on why Postgres.

~~~
macobo
You need a queue in front of your database irregardless of write latency.
Otherwise you tie your availability to database availability as downtime (even
for upgrades) is often unavoidable and network problems are common.

Dan gave a pretty good talk about the high-level details of how/why postgres a
couple of years back:
[https://www.youtube.com/watch?v=NVl9_6J1G60](https://www.youtube.com/watch?v=NVl9_6J1G60).

One reason of why pg is that SQL is really powerful for building complex
queries.

------
ComodoHacker
The selected approach "1 customer data artifact == 1 database metadata object"
doesn't look scalable. Sooner or later (I bet sooner) metadata maintenance
overhead will outweigh all the benefits. It didn't work with tables in
thousands projects (poor man's multi-tenancy) there's no reason it'll work
with indexes.

Treat all your metadata as data at DB level, your DB metadata are one level
higher.

------
shubb
Clearly SiteGround is not quite so salable as Heap...

~~~
AustinG08
siteground shared hosting - you get capped at X number of executions per 2
hour / 24 hour period / month. If reached, they shut you down until your quota
refreshes at the end of the month or if you contact them to get your account
reinstated. On the plus side, their tech support is helpful and hands on. Op
should probably upgrade their hosting plan.

------
RmDen
Interesting that you named it Heap, in SQL Server land a Heap is a table which
does not have a clustered index

------
Dharani
select * from sign

------
ohnoesjmr
Why not simply full indexes?

~~~
kornish
As I understand it, the types of analysis they want to do actually aren't
super amenable to indexing because they may include wildcards in the middle of
the search string (e.g. `WHERE text ILIKE 'div.button % .login'`. You can make
a btree index on the `text` column, but queries with the above `WHERE` clause
won't use the index.

To index those effectively, you'd need to create a trigram index in Postgres,
which takes up a TON of space. Because 90%+ of rows won't satisfy any of the
predicates defined by the customer, it's probably unnecessary to trigram index
every row.

Out of interest, malisper, how much space do you think you save by using
partial indices?

edit: removed a statement I couldn't support about taking unnecessary CPU

~~~
malisper
> To index those effectively, you'd need to create a trigram index in
> Postgres, which takes up a TON of space.

I actually ran some experiments on using a naive trigram index and found it
doesn't currently don't work for our use case. Most of our data is highly
redundant (e.g. Every CSS hierarchy contains `div`) in it. This causes every
query to read a large portion of the trigram index which makes the query
really slow. I've heard the developers of pg_trgm improved the performance of
this use case in a recent version of pg_trgm. I haven't reran my experiment
yet because Heap is currently on Postgres 9.5, and the new version of pg_trgm
is only available for 9.6+.

I believe partial indexes collectively account for ~5% of our disk storage. A
trigram index alone would cause an about 30% increase in our disk usage. To
eliminate partial indexes, we would also need full indexes on other commonly
queried fields such as the url or the text of what was clicked on. Each of
these indexes adds a solid amount of space.

