
Ask HN: 120M rows Postgres – how can I speed up queries? - dbnewbie
The table is flat (no related), but many queries we make (including aggregates) take hundreds of milliseconds.<p>Forgetting the specific queries for a moment (basically all queries in this table are relatively slow):<p>How would you handle such a scenario? Shard the database?
======
rapphil
Have you checked if you are using the correct settings in order to extract the
max from your hardware?

[https://pgtune.leopard.in.ua/#/](https://pgtune.leopard.in.ua/#/)

Are you monitoring your machine to check if it is not starving on cpu?

Re the queries, we have been using pgbadger to collect metrics about the usage
of the dB and the slowest queries by type. This is helpful as it guides where
you should put your efforts.

[https://github.com/darold/pgbadger/blob/master/README.md](https://github.com/darold/pgbadger/blob/master/README.md)

This is very good ref about scaling Postgres.

[https://pyvideo.org/pycon-ca-2017/postgres-at-any-
scale.html](https://pyvideo.org/pycon-ca-2017/postgres-at-any-scale.html)

~~~
dbnewbie
Thank you an absolute ton! I’m gonna be watching that scaling video from Pycon
this afternoon!

We definitely have a decent configuration, in terms of using the hardware best
for our typical workload. We also have indexes on the fields we are using for
filtering. It’s crazy, I put this question here because I feel like this is an
inevitable thing that everybody just runs into over and over with scaling and
the only real out is sharding, so I’m glad to see you a bunch of suggestions
here and I’m going to watch that video as I mentioned. Thanks again.

------
ezekg
Check out [https://www.pgmustard.com/](https://www.pgmustard.com/) if you
haven't already. It has helped me optimize queries as much as possible when
querying large datasets.

~~~
dbnewbie
Thanks! That’s awesome. I know roughly how to understand PG query plans, but
it doesn’t ever seem to actually help me figure out a solution. This service
looks like it will help quite a bit.

------
xq3000
There is not a lot of specific details here but have you already tried and
exhausted things like analyzing execution plans, index optimizations, index
encoding, value encoding, denormalization, read replicas, and [front] caching?

~~~
dbnewbie
I have not looked into “index encoding”. In fact, I haven’t even heard of
that, thank you for the suggestion!

In terms of execution plan, the query we are doing is relatively basic even
though it includes some aggregation. The aggregation is rule (CASE) based and
very simple. It feels like there is no way to quickly (sub-50ms) retrieve
information from a database once you reach the high tens of millions of rows.

~~~
xq3000
By index encoding I just mean putting “encoded” values into the indexed column
of your table. Something like this: “city:department:sku”.

This technique is a bit advanced, borrowed from hierarchical databases, and
optimizes for _specific_ queries known upfront, so it’s cool but not very
flexible. There is a lot more to making it work. You can watch [1], if
interested.

But I’d also +1 other suggestions here on fine tuning your db engine and just
scaling up the server.

[1] [https://youtu.be/jzeKPKpucS0](https://youtu.be/jzeKPKpucS0)

Disclaimer: I’m with AWS.

~~~
dbnewbie
Interesting, that does sound like an advanced indexing technique, but also
sounds like a really good idea. It reminds me of the old flat file database
formats I read about.

------
gigatexal
Have you thought about materialized views that pre-aggregate your queries and
then updating those views with triggers? I got this idea from a forum so it’s
not mine but it works.

Sharding by month or other bucket of time could help.

We have a very similar situation except it’s billions of rows. One benefit is
it’s a bit denormalized in that we store the meat of our data in a hstore
field

~~~
dbnewbie
Thanks a ton for the help, I will take a look into what we might be able to do
in terms of adding a materialized aggregate view!

~~~
gigatexal
You’ll get the most benefit from pre-aggregating not for a specific query but
in a general way. Sum orders group by user in a view and then use triggers to
update that view. That way to get the orders for a given user it’s a simple
lookup and all the calculation is done. And this way you can build up meta
reports from these general aggregates.

~~~
dbnewbie
Ok awesome, and thanks again!

~~~
gigatexal
Just to be super clear don’t use the triggers to rebuild the view every time
but merely to add one to the sum’d account when a row matches — I wasn’t sure
I made that clear. Good luck! And do put up a blog post or something if it
works out

~~~
dbnewbie
OK that makes sense, and I’ll do a bit of reading on materialized view
functionality before actually implementing anything.

~~~
gigatexal
Again not my idea. It was suggested to me in response to the very problem
you’re having. I’m just so happy to have it as an idea that I didn’t want to
keep it to myself. Here’s the link.
[https://hashrocket.com/blog/posts/materialized-view-
strategi...](https://hashrocket.com/blog/posts/materialized-view-strategies-
using-postgresql)

------
lastofus
From my experience 7+ years ago, I saw an order of magnitude difference
running Postgres on higher end bare metal hardware (think $6k rack mounted
server w/ nice RAID controller) vs a supposed high end AWS VM w/ an EBS
volume. That was testing with spinny disks before SSDs were the norm, but if
you are on a cloud VM, it's probably worth investigating.

~~~
dbnewbie
Thank you for the suggestion. I don’t know for certain that it’s a tenable
solution, because we are using some services that would take some time to
vertically integrated if we were to go to a colo. But when we reach that scale
that will definitely be an effort worth exploring.

------
usgroup
Use c_store fdw to store your data in columns and marvel at the 10x
performance improvement without indexes.

~~~
dbnewbie
Excellent, thanks for the suggestion. I hadn’t heard of this lib until now.
And, seeing that this is developed by Citus makes it automatically 10x better.

