Hacker News new | past | comments | ask | show | jobs | submit login
Running 10M PostgreSQL Indexes in Production (heap.engineering)
263 points by luu on June 13, 2017 | hide | past | web | favorite | 65 comments

Author here. Feel free to ask me any questions.

By the way, the 10 million number in the title is outdated. It's now about 30 million indexes.

Also, I've been publishing one blog post a weekday on my personal blog about one Postgres feature or component. If you liked this post, you should check it out: http://malisper.me

I have so many questions. I've made a number of internal dashboards for companies over the years, so I've always been curious about how a real analytics company handles some of the challenges where n > 1:

1. How good is Citrus really? It seems almost too good to be true but the line on their product page: "[Citrus is not a good fit for] free form SQL by analysts. Because of the distributed nature of Citus some SQL primitives do not map seamlessly. If you have data scientists or end customers that are looking to run free-form SQL, not controlled by an application, then Citus is not an ideal fit for you." Gives me pause. I'm seriously thinking of migrating to them for a project we have.

2. Do you use the cloud or the enterprise pricing for Citrus? If enterprise, how much are we looking at here? I assume it's at least five digits in a setup fee, but enterprise pricing is so hard to predict.

3. Before moving to Citrus what were some Postgres performance tactics that you used other than partial indexes that made a real difference on performance?

4. When did a single Postgres instance on a single server stop scaling? Was it a problem of writes per second or querying? How long did it take to migrate to Citrus?

5. What is your primary application written in? Do you split it off by endpoint and use something higher level for the low throughput endpoints or do you just write everything in, say, Java or Go?

6. How do you deal with cyber attack? Even normal sites get targeted, but breaking into a analytics company seems like a cyber criminals dream. How do you handle real vs fake traffic on your clients website? I could imagine spammers exploiting this or is that not a primary problem?

I feel tempted to ask more, but I'm afraid I'd be exploiting your offer to the point of being a chore.

Craig here from Citus. Happy for any others that have had experience with Citus to chime in, though happy to try to answer them to give some insights while fully admitting it might be a bit partial :)

1. We've got plenty of customers that rely on Citus as their system of record. How good is it really comes down to your use-case. We're a good fit for certain transactional workloads that have a natural shard key (this is the case for many multi-tenant apps). Or hybrid tranasction/analytical apps like Heap. In these cases you're usually powering some end user facing dashboard and need subsecond query response across a large dataset much like Heap. We're less of a fit for traditional data warehousing, and then if you only have 10 GB of data single node Postgres works just fine.

2. We have customers that come on board for both Enterprise and Cloud. In the past year since we've launched our database-as-a-service we do have a steeper curve of people ramping up on it. It's of course, easier to press a button and have a scalable database than it is to set it up and manage yourself. In general if you're already running on AWS then Citus Cloud makes sense, if you can't run on AWS we're always happy to try to make Enterprise work for you, not to mention there is the open source version.

4. This really varies based on application, lots of apps don't run into issues until they're at a 1-2 TB, and others benefit from parallelism as early as 40-50 GB. That's not to say they couldn't keep throwing more hardware at it, but we've seen some customers migrate at 100 GB and get 2-3x or more performance improvement from single node Postgres.

As the rest are a bit more specific to Heap I'll let them chime in, though happy to answer any more questions from the Citus side where helpful.

It's Citus Data and CitusDB, not Citrus.

I had a look at Citus for analytics workload (typically using full table scan queries), comparing it with Greenplum and Impala (which uses HDFS and Hive metastore, but not much more from Hadoop). I found Citus to not be very competitive with Greenplum; Greenplum executed table scans quite a bit faster, but more importantly Citus didn't have column stores without adding the cstore extension, and column stores massively increase table scan performance (like, multiple orders of magnitude); and cstore_fdw isn't in the box, you need to add it manually. Greenplum in turn didn't perform as well as Impala as data scaled up; also, Impala could deal with thousands of columns, whereas Greenplum and Citus inherit Postgres limits. Greenplum is also tied together with ssh and scripting duct tape, operationally it's a lot less flexible than HDFS.

Postgres on its own isn't very good at analytics queries, it's far too slow for similar reasons to why Citus without cstore was too slow for us, only you get even less I/O parallelization.

You should look at memsql.com

Hey Michael, not directly related to your post, but I wanted to know how a noob(like myself) can be good at Postgres?

I am a frontend developer who wants to start making(and deploying) full stack apps. I have heard that Postgres is one of the most reliable relational databases out there and I would love to use it in my apps.

Any suggestions on how to get started with Postgres? Or relational dbs in general? Books/courses/making projects/anything else?

Thanks in advance!

I'm not Michael, and I'm not sure if this is the answer you want to hear, but if you're a frontend developer the database is likely going to be the least of your worries when you start. Making and deploying the backend is going to take much more of your time than doing Postgres well.

There's a lot of blog posts on HN about bending databases like Postgres to pretty insane workloads. Heap's workload is a nice example of that. they're interesting blog posts but don't let them distract you. Most databases really do work pretty well by default, if you just use them the simplest way possible. Most apps don't have insane workloads, and especially when you start out they won't.

If you just make a table for every "thing" in your system and link them together with foreign keys you're pretty much set. Most popular frameworks (Rails, Django, etc) do this for you without even touching SQL.

This approach is fine for nearly every case. I bet even Heap started like that. Famously, Twitter started like that as well, and the only reason that got them in trouble (remember the failwhale?) was because they were growing faster than nearly any platform ever before. Most sites don't 10x their user count every half year, and if you do, lucky you. Scaling problems are luxury problems.

Half of HN is jealous of people who have scaling problems. I sure am. The other half, well, they won't have scaling problems because they're not shipping fast enough. Their backends would scale though. They read all the books on Postgres and doing it right from the start!

This is exactly right. Install Postgres, add some indices on the columns you're querying, and you're probably good for the next ten years in that respect. Try to focus on making a product people want and making them aware of it instead. No company[1] has ever died because their tech stack wasn't good enough.

[1]: Approximately.

I ran a web agency in Silicon Valley and I encountered numerous clients who's budgets were significantly hampered by having to maintain or scale poorly designed tech stacks. MediaZone was the biggest. They chose to use an in-house, proprietary ASP.net framework for their web platform. They were live streaming major sports programming back in 06/07 like Wimbledon. They had tons of customers, but their development run rate and slow speed significantly contributed to their parent company to shutting them down. It's true that most companies never get to their scale (~200 employees), but if you want to get that big, you have to plan a text stack early on that can accommodate your aspirations.

Well, how far in advance are you planning? It's easy to over-optimize when you're starting out and you end up all tangled in an overly-complex system that is over-engineered for your needs.

Everything in moderation. You can often run a single master PostgreSQL system into the millions of transactions a second without having to resort to sharding and other complexities.

As someone who uses Django more or less full time, I don't think relying on the ORM without knowing any SQL is especially good advice.

Django's ORM is very easy to use, but it is not especially powerful, and when you want to do advanced queries (multiple join conditions for example) it often isn't possible. When someone asks me for some data from the database, usually its easier for me to go directly to the database than use the Django ORM.

Use the ORM to start with, but its most definitely a good idea to learn proper SQL .

ORM is very much like hypermonadic supertemplate C+++++ type theory, you can only play with it if you know what you're doing, but don't want all that boilerplate (cognitive or in-code unit tests) in your head.

So, I think starting with PHP and raw SQL is a good thing for motivated learners.

And if you need motivation, try building a WordPress site with themes and plugins first. And maybe an Android app. And then something cute like a LED blinker with a raspberry.

I learnt a lot of Postgres-ing while working on a project that was a SPA front-end, and then a fairly basic (unto itself) Node application using [Massive.js](https://massive-js.readthedocs.io/en/v2/) which exposes Postgres, its SQL, and its features rather than hiding them.

Then something that I felt taught me a lot was endeavouring to use only a single SQL query per HTTP request. Which meant some silly things like learning CTEs in much more depth (including stuff like doing updates and selects via multiple cte's but within a "single" query); and using more aggregation and window functions to create the JSON responses from the DB itself as much as possible. (I used some PLPGSQL as well, although I tried not to just reach for it every time, hence why I mention it in brackets rather than directly :-p )

It wasn't perfect, but it was a fun thing to do and I felt like I learnt a lot more about Postgres (in particular, but SQL/DBs on the whole) as I progressed.

(The reason I mention this is that Massive.js is really nice for Postgres, and as a frontend developer, using Node.js might be reasonably approachable?)

It's nice to see a node app using PG and not just using Mongo by default!

Craig Kerstiens has a phenomenal set of blog posts about a few different Postgres topics to get you thinking: http://www.craigkerstiens.com/content/

It's really worth checking out all his stuff - his explanations are concise and easy to follow: http://www.craigkerstiens.com/categories/postgresql/

(self plug) for learning Postgres-flavoured SQL, take a look at https://pgexercises.com.

Books-wise, Learning SQL is a great start. http://shop.oreilly.com/product/9780596007270.do

The postgres docs themselves are also pretty fantastic.

Maybe try https://postgresweekly.com for good articles and tips

The O'Reilly books are a good reference.

For theory I really enjoyed reading E.F. Codd's papers on the relational model. He was a good writer!

I am also writing a book you might be interested in https://agentultra.com/postgres-book.html where I focus on treating PostgreSQL as an application server. It covers data design, server programming, extensions, and integration with external processes and data stores.

It clicked for me after reading Joe Celko books: https://www.amazon.com/Joe-Celko/e/B000ARBFVQ

Any specific ones you can point out?

One simple way is to just have an idea of what you can do with it. Doesn't mean you have to do all of it, but having an idea of the capabilities is a good place to start.

Outside of that, there's a lot that you'll just pick up through usage. Get comfortable with basic SQL and then everything else in Postgres is a bonus.

I attempted to write a high level capability overview for that exact purpose. :-)


I know Postgres really well because my job is mostly scaling Postgres. I got a good feeling of the tools of available by reading the Postgres manual. Then I got to learn when each tool is effective and the tradeoffs between each of them while working.

Did you consider creating new tables (with appropriate, more generic indexes) vs. creating indexes for everything?

You've got to consider that indexes can be viewed as cut-down tables that you can't directly address.

I would have thought that you are going to start leaning heavily on the query optimiser/index updaters, which may well not be well designed for this kind of approach. With the caveat that it obviously seems to working now :-)

One of the nice advantages of partial indexes is that Postgres will automatically fallback to the next best possible plan. If we created separate tables for each event, either we would have to handle the logic for falling back ourselves, or the data would be momentarily non queryable.

I have come up with a schema/set of indexes that we think would allow us to have a fixed set of indexes. The downside is we would have to do a lot of the work Postgres is currently doing ourselves.

Hey malisper, Have you investigated BRIN[1] indexes? Have you considered or benchmarked those? For naturally ordered data it seems to be pretty efficient.

1. https://www.postgresql.org/docs/9.5/static/brin-intro.html

While our data is ordered, a BRIN index doesn't support our use case. We need to perform non-zero amount of deletions/updates to existing data. This leaves holes in our table which will be filled in by new data, which makes a BRIN index unusable since the data is no longer ordered.

I would have thought a single multi-column index would have been more appropriate. Did you consider these? They have been in PostgreSQL since 9.2


A single multi-column index would only work for the one specific query in the example. We have many queries that contain filters that are not easy to index such as regular expression filters. Additionally, a multi-column index would be significantly larger than the corresponding partial index.

After the Heap snippet is added to a website, Heap will automatically capture every event that users do on that website. That means every click, pageview, form submission, etc, are all captured and logged into Heap. From there, the owner of the website can use Heap to answer questions about their users.

Would it be possible to filter out only interesting events?

What do you mean by "filter out only interesting events"?

The Heap philosophy is to capture all of the data up front and then make it easy for our customers to specify and query the subsets of the data they care about. That way our customers never have to write any tracking code and they never have to worry about forgetting to track something.

That is a great blog! We recently started to play around with recursive queries in PostgreSQL. We have a gigantic tree with several thousand nodes. With recursive queries, query time was reduced from 30 seconds to 120ms! And with this we also found new ways to improve workflows in our application.

This may be a stupid question but I glanced over the marketing website and didn't really understand how logs are stored.

- Are you storing each log event as a row in Postgres?

- How do you rotate logs, what happens when logs get indefinitely large?

- Can logs be exported to some common format for offline analysis?

We create one row per event logged in Heap.

We currently have a default 3-year retention policy. We will keep your old data for up to three years.

We have a redshift integration which you can use to export your data from Heap into redshift.

Great article! Thanks for taking the time to write it up.

What sort of performance impact do these indexes have on writes? I'm guessing PostgreSQL has to look at evaluate each index predicate for the inserted data to see if the index needs to have an entry added.

> What sort of performance impact do these indexes have on writes? I'm guessing PostgreSQL has to look at evaluate each index predicate for the inserted data to see if the index needs to have an entry added.

When we first deployed partial indexes as our primary indexing strategy, we did notice significantly higher CPU. At the time our assumption was that evaluating the partial index predicates was the cause of the high CPU. I later wound up investigating the high CPU use and found the time wasn't spent on evaluating the partial index predicates, but on actually parsing them! I recently wrote up another blog post on how I found this out: https://blog.heapanalytics.com/basic-performance-analysis-sa...

I've been reading your blog posts and they're very well done. Kudos!

Please keep writing. I very much enjoy Postgres and also your blog about it.

Does the predicate in your queries have to match the index exactly or can you add more conditions and have it still use the index? (assuming AND not OR)

You can add more conditions and Postgres will use the index to first find the rows that match the partial index condition and then use the additional condition to postfilter the rows.

excellent - thanks for the information!

Just wondering if you looked into using GiN or GiST indexes on the JSON in the table?

Yes, we have. The problem a lot of our keys and values are numeric. GIN/GiST only support equality for JSON fields. Additionally the GIN/GiST index winds up being significantly larger and is a bit slower than partial indexes.

Your blog looks great, thanks for sharing!

"Running 10M PostgreSQL Indexes in Production"

Also, our website is down due to modest traffic.

Obviously they are using a hosted wordpress service, which has nothing to do with their application and the content of the blog.

wordpress.com is free and takes infinite traffic just fine.

That's actually for a different article, talking about the use of partial indexes.

The current article (Nov 2016) is here: http://webcache.googleusercontent.com/search?q=cache:KOIWNI0...

One kind of significant downside of partial indexes in PostgreSQL is that they don't work with parameters. Like, if you have a partial index on username_id, this will work:

select * from users where username_id = 5;

but this won't use the index:

select * from users where username_id = (select 5);

Basically, the query planner won't evaluate the parameter before it decides what strategy to use.

> One kind of significant downside of partial indexes in PostgreSQL is that they don't work with parameters.

Your example is fundamentally different from parameters sent by a client in a parameterized query, i.e. $1, $2, etc. Also while your flawed example will never use a partial index, parameterized queries can and often will do that. With prepared statements (which are a slightly different concept from parameterized queries) the devil will lie in the details, but even those don't really prevent partial indexes from being used since postgres version 9.2, released years ago.

my `flawed` example is fine for what I'm using it for: demonstrating that PSQL won't use a partial index in an obviously common use case. Feel free to argue about the semantics of what is a prepared statement vs parametrized query vs whatever, but the fact is that the PSQL planner does not evaluate the subquery select prior to execution nor is it capable to alter its original plan after execution of the subquery.

This makes partial indexes of EXTREMELY limited use for complex queries which often need to pass a non-literal or do a join on a partially indexed column.

In fact, it is this EXTREMELY inconvenient limitation that often forces people to rearchitect large tables around a set of smaller tables with inheritance and check constraints, so that the planner will do index scans on the smaller tables, rather than a huge index.

In fact, if partial indexes were properly architected and supported, it would eliminate 90% of the use cases for partitions in PSQL. Partitions, by the way, are also a sham in PSQL 10, merely putting some syntactic sugar over table inheritance and what is basically a table-routing insert/update trigger. The current planner support for them is a joke / non-existent / worse than roll-your-own inheritance with check constraints and triggers.

would it if you used a join instead? `select * from users inner join (select 5 username) as x using (username)`

Good to know about the partial indexes on Postgres, but I am little queries to ask you as an analytics company your system is write-heavy system, how did you justify increasing latency of writes, I am pretty sure that you have some kind of queue in front of Postgres to overcome this problem, secondly what I understand from the blog and nature of event-based analytics it's generally time series data, What made you stick to Postgres where I can easily see lot of different datastores build specifically to deal with time series data like Elasticsearch and influxDB ? can you give me some insights on why Postgres.

You need a queue in front of your database irregardless of write latency. Otherwise you tie your availability to database availability as downtime (even for upgrades) is often unavoidable and network problems are common.

Dan gave a pretty good talk about the high-level details of how/why postgres a couple of years back: https://www.youtube.com/watch?v=NVl9_6J1G60.

One reason of why pg is that SQL is really powerful for building complex queries.

The selected approach "1 customer data artifact == 1 database metadata object" doesn't look scalable. Sooner or later (I bet sooner) metadata maintenance overhead will outweigh all the benefits. It didn't work with tables in thousands projects (poor man's multi-tenancy) there's no reason it'll work with indexes.

Treat all your metadata as data at DB level, your DB metadata are one level higher.

Clearly SiteGround is not quite so salable as Heap...

siteground shared hosting - you get capped at X number of executions per 2 hour / 24 hour period / month. If reached, they shut you down until your quota refreshes at the end of the month or if you contact them to get your account reinstated. On the plus side, their tech support is helpful and hands on. Op should probably upgrade their hosting plan.


Interesting that you named it Heap, in SQL Server land a Heap is a table which does not have a clustered index

select * from sign

Why not simply full indexes?

As I understand it, the types of analysis they want to do actually aren't super amenable to indexing because they may include wildcards in the middle of the search string (e.g. `WHERE text ILIKE 'div.button % .login'`. You can make a btree index on the `text` column, but queries with the above `WHERE` clause won't use the index.

To index those effectively, you'd need to create a trigram index in Postgres, which takes up a TON of space. Because 90%+ of rows won't satisfy any of the predicates defined by the customer, it's probably unnecessary to trigram index every row.

Out of interest, malisper, how much space do you think you save by using partial indices?

edit: removed a statement I couldn't support about taking unnecessary CPU

> To index those effectively, you'd need to create a trigram index in Postgres, which takes up a TON of space.

I actually ran some experiments on using a naive trigram index and found it doesn't currently don't work for our use case. Most of our data is highly redundant (e.g. Every CSS hierarchy contains `div`) in it. This causes every query to read a large portion of the trigram index which makes the query really slow. I've heard the developers of pg_trgm improved the performance of this use case in a recent version of pg_trgm. I haven't reran my experiment yet because Heap is currently on Postgres 9.5, and the new version of pg_trgm is only available for 9.6+.

I believe partial indexes collectively account for ~5% of our disk storage. A trigram index alone would cause an about 30% increase in our disk usage. To eliminate partial indexes, we would also need full indexes on other commonly queried fields such as the url or the text of what was clicked on. Each of these indexes adds a solid amount of space.

My manager wrote a blog post explaining partial indexes and some of the advantages of them: https://blog.heapanalytics.com/speeding-up-postgresql-querie...

In that blog post, you store "time" as BIGINT. Does it work better than TIMESTAMP?

It's same. Timestamp also uses bigint. It's nicer to use timestamp I guess since you get many datetime functions.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact