
How to Scale PostgreSQL on AWS: Learnings from Citus Cloud - twakefield
https://www.citusdata.com/blog/2017/03/10/how-to-scale-postgresql-on-aws/
======
simonw
Citus are doing a fantastic job on content marketing. Every single piece they
publish on [https://www.citusdata.com/blog/](https://www.citusdata.com/blog/)
is a case-study in how to write content (and headlines) that appeal to the
kinds of developers their product targets.

"How to Scale PostgreSQL on AWS: Learnings from Citus Cloud" \- seriously, how
am I as a PostgreSQL-liking developer who cares about scalability NOT going to
click through to that article?

~~~
craigkerstiens
Simon, thanks so much for the kind words. It's nice to know that the content
is useful. In general we're focused on sharing both content about how to use
Citus and what it's good at, but also good and interesting engineering content
as well as we want to add value to our users and broader community. It's great
to hear that we're doing alright from your perspective on that front.

~~~
Artemis2
Really appreciate the quality of the content, there is some real value to it.

------
kornish
Citus Cloud is perhaps most exciting me because it has tremendous momentum: as
the combined product of deep technical expertise meeting top-flight open
source software meeting tons of end user experience, it's quickly outpacing
platforms which are locked-in anachronisms. Take Redshift: Postgres 8.4? After
you've used some of the features in 9.6, it's hard to go back. It'd be
interesting to see some numbers around Citus Cloud's battle-tested
deployments.

As a side note, these blog posts on high-level techniques and open source
tools (e.g. PgBouncer, wal-e) are useful for anyone considering deploying an
on-prem version of Citus as part of a product – thanks, Ozgun!

Usual disclaimers apply: not an employee, but big fan of the team and
technology and it's great to see them gaining well-deserved mindshare.

------
pjungwir
I saw the section on EBS, but it didn't offer much advice. Getting good
performance on networked storage is the biggest challenge to me. The last time
I asked about that here [1], I got this answer:

    
    
        nasalgoat 161 days ago [-]
    
        The secret to EBS is to use General SSD,
        not Provisioned, but use a RAID stripe.
    
        The reason this works is because IOPS are provisioned
        per EBS drive and by the size of the drive. So a RAID0 
        stripe of, say, ten General SSD drives will outperform 
        the more expensive PIOPS single drive.
    

That sounds like a great approach, although I haven't had time to try it out
yet. I'm curious if anyone else has done anything like that.

[1]
[https://news.ycombinator.com/item?id=12609172](https://news.ycombinator.com/item?id=12609172)

~~~
kinghajj
I'm working on an approach using ZFS on K8s. The idea is to write a custom
"flexvolume" driver that takes set of EBS volumes an options, attaches them,
create a zpool, set options, etc. In tandem, however, I'm adding to each node
host a systemd service to create dm-crypt+LVM devices atop of each NVMe drive
(as in the i3 instance types). The flexvolume driver will support adding SLOG
and/or L2ARC devices by provisioning LVM volumes from those.

Something I'm still unclear, is whether EBS does any data checksums, in which
case I could disable ZFS' to spare CPU cycles (but keep compression, of
course.)

~~~
illamint
> I'm adding to each node host a systemd service to create dm-crypt+LVM
> devices atop of each NVMe drive

That sounds cool; can you describe how you're using systemd for this in a bit
more detail? I do this with an Ansible playbook when the machines are
provisioned.

------
manigandham
We use MemSQL and it has the best replication setup process for any relational
database with 1 line:

    
    
      REPLICATE DATABASE db_name FROM master_user[:master_password]@master_host[:master_port][/master_db_name]
    

Why is it in 2017 we still don't have any other database that can come close
to this? Basic replication is very well understood and used everywhere but it
seems like database creators just don't understand what should be prioritized.

~~~
anarazel
The stuff in postgres 10 isn't that much different (for logical replication).
On the master you do

    
    
      CREATE PUBLICATION all_tables FOR ALL TABLES;
      or
      CREATE PUBLICATION important_tables FOR TABLE transactions, users;
    

and then on the node you want to replicate to

    
    
      CREATE SUBSCRIPTION backup_important CONNECTION 'dbname=foo host=bar user=repuser' PUBLICATION important_tables;

~~~
manigandham
What about editing the postgresql.conf, pg_hba.conf, recovery.conf files,
setting up archiving and doing the first base backup transfer? That's the
actual hard part.

Logical replication doesn't replicate DDL commands either from what I know so
far so that's another major problem.

~~~
anarazel
> What about editing the postgresql.conf, pg_hba.conf, recovery.conf files,
> setting up archiving and doing the first base backup transfer? That's the
> actual hard part.

I talked about logical replication here... But the defaults for those are
going to be a lot saner too in 10. We're not going to expose postgres to the
world by default, so you'll have to change listen_address - but I think
everything else would be a really bad deal security wise.

> setting up archiving

You don't really need that anymore in the more basic cases. Unless you
want/need point-in-time recovery to older states, replication slots make that
a lot easier.

> doing the first base backup transfer?
    
    
        psql -h master "SELECT pg_create_physical_replication_slot('my_standby');
        pg_basebackup --write-recovery-conf --slot my_standby -D targetdirectory
    
    

I'm not saying things are perfect, but they're clearly improving. And some of
the complexity (e.g. not allowing remote connections by default) are pretty
basic security vs. simplicity tradeoffs.

~~~
manigandham
Default remote access off is perfectly fine, we don't need the mongo style
security problems.

However the rest just seems to be a overbuilt mess. Why not start simple with
barely and options and work up from there instead of trying to build in so
many different configurations and features?

90% of use cases would be satisfied with a simple async replica on a different
server started with any user that has read access to that database and a
single command, along with the ability to become a standalone master anytime.

Redis (yes its a different database) has another nice and easy setup with a
simple "slaveof" command.

------
cromulent
I was looking for the "I want my database to be performant under high random
load" question. PIOPS can hurt.

Anyone have any experience running PostgreSQL on the new I3 instances?

~~~
kinghajj
[https://news.ycombinator.com/item?id=13842222](https://news.ycombinator.com/item?id=13842222)

Not yet, but you're not the only one with that idea!

------
agentgt
I have mentioned this on some previous posted articles but we are really happy
users of both citus and pipelinedb.

Check out pipelinedb if you are a Postgres fan (obviously it is for a
different use case than Citus).

The only thing I don't like about pipeline is that it currently is a fork and
not an extension but that is supposed to change.

Consequently we syndicate to citus and pipeline through rabbitmq and Kafka.

We use google cloud as well. I'm contemplating on writing a post on what we
have learned (and not :)) but I don't think I could ever match the quality of
this article.

And yes invariably some one will mention memsql does both but it is
proprietary and not Postgres. I probably should have spent more time
investigating it though (and eventually will).

------
jacobscott
Does Citus (Cloud?) have features that offer better high availability and
failover functionality than what RDS provides? Managed Patroni and packaged
workflows for zero-downtime failover would be quite interesting, but I don't
see anything like that mentioned on
[https://www.citusdata.com/product/cloud](https://www.citusdata.com/product/cloud).

~~~
craigkerstiens
We don't use Patroni or any of the other off the shelf items. We rolled our
own primarily from our years of experience on Heroku Postgres. We're actually
working on a detailed post on how HA and disaster recovery works for Citus
Cloud, though the core mechanism powering it all under the covers is our state
machine. You can read a bit about how it works in this post:
[https://www.citusdata.com/blog/2016/08/12/state-machines-
to-...](https://www.citusdata.com/blog/2016/08/12/state-machines-to-run-
databases/)

~~~
jacobscott
Thanks! I look forward to reading about implementation details.

It does seem like documentation of Citus Cloud HA and disaster recovery
_behavior_ is a bit light, compared to e.g.

\-
[http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Concep...](http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Concepts.MultiAZ.html)

\- [https://devcenter.heroku.com/articles/heroku-postgres-
ha](https://devcenter.heroku.com/articles/heroku-postgres-ha)

"In addition to continuous protection which is explained above, high
availability is available if your application requires less exposure to
downtime. We provision stand-bys if you select high availability at
provisioning time. This can be for your primary node, or for your distributed
nodes."
[https://docs.citusdata.com/en/v6.1/cloud/features.html](https://docs.citusdata.com/en/v6.1/cloud/features.html)

Please let me know if I've missed any resources on this topic!

~~~
craigkerstiens
Yep, our docs are definitely behind the product here, we'll be working to
improve. FWIW it would be the closest to the Heroku Postgres HA as it's the
same person that wrote both systems.

------
forgotpwtomain
Why is it seemingly impossible to read a technical blog-post on a company-
blog, without some seven-year-old-humor type meme mixed-in?

------
hayd
I wonder how Postgres Aurora will fair against Citus... that's what we're
considering migrating to in the next year or so.

~~~
dhd415
Aurora PostgreSQL and Citus are pretty different. The big draw of the former
is that it provides PostgreSQL as a managed service on top of a cloud-
optimized flash storage layer that facilitates high availability, durability,
and low-latency read replicas for PostgreSQL workloads that fit on a single
machine. The latter provides a product (or managed service in the form of
Citus Cloud) that performs transparent sharding that allows for scaling
relational workloads beyond the limits of a single machine. Most workloads
will favor one or the other of those scenarios pretty clearly, either by not
stressing the capabilities of a single machine or by being easily shard-able.

~~~
MichaelGlass
Yeah, Aurora is great for scaling already-fast reads but if you wanna scale
writes or get large indices in memory, it's not going to help you.

------
jordanthoms
Any plans to take Citus in more of a data-warehousey, complex queries
direction over time? We are starting to hit test limits of Postgres 9.6 and
would like to move to a columnar store, but Redshift is hosted-only, Teradata
looks expensive, Greenplum looks old.

~~~
mindprince
Try ClickHouse if you need a columnar DBMS:
[https://github.com/yandex/ClickHouse](https://github.com/yandex/ClickHouse)

We started using it recently and it's been amazing.

~~~
jordanthoms
That looks like a great suggestion, thanks! It looks like we'd need to break
out our JSON data into columns (our events table is basically just an event
name, time, user id, and then a jsonb column) and rewrite our queries, but it
seems more suitable than anything else I've seen so far.

We've been getting good results with some tweaks to our Postgres indexing
(partial indexes on event names makes a huge difference) and that'll be even
more practical when we can move to logical replication, so will be sticking
with that for now.

------
jaequery
i feel DB hosting is such an underrated field right now. in terms of scaling
everything is pretty easy to scale except databases. i would love to see more
services like this.

------
BIackSwan
Aren't most of these use cases already offered/handled by Amazon RDS? Maybe
not transparent sharding - but otherwise everything else?

~~~
kod
So basically everything except horizontal scaling?

Seems like a pretty big difference...

------
LogicX
Why does the community link on your pricing page lead to a 404?

~~~
craigkerstiens
Thanks for reporting, looks like a place we missed updating for our new
tutorial. Should be fixed now.

------
marknadal
1\. I’d like my PostgreSQL database to be Highly Available

Highlight: "The first is the complexity associated with it: it takes twelve
steps to setup streaming replication ... open source solutions such as
Governor and Patroni aim to do just that. That said, this integration again
comes with a complexity cost."

I cannot believe it is 2017 and streaming replication is still considered
complex. I have spent the last half decade+ of my life to try and make this
simple, here is a demo:
[https://youtu.be/-i-11T5ZI9o](https://youtu.be/-i-11T5ZI9o)

2\. I’d like my application to not worry about failovers

Highlight: "most PostgreSQL clients don’t have a mechanism to automatically
retry different endpoints in case of a failure."

Master-Slave systems are not conducive to failover (determining a new Master
involves its own locking/election mechanisms). If we have streaming Master-
Master replication by default, you can have some easy automatic failover -
[https://youtu.be/-FN_J3etdvY](https://youtu.be/-FN_J3etdvY) .

4\. I’d like my database to scale horizontally

Highlight: "Deploying a distributed RDBMS into production requires a good
understanding of both relational databases and distributed systems."

We can do a lot of work to improve understanding out there, Kyle Kingsbury
(Aphyr of Jepsen Tests) has done a lot to spread awareness. A couple years ago
I did a tech talk that explains the ideas with stick figures so that way even
laypersons could understand what is going on:
[http://gun.js.org/distributed/matters.html](http://gun.js.org/distributed/matters.html)
.

5\. I’d like automatic backups for disaster recovery

Highlight: "Distributed database backups are even harder."

See the (1) demo, this doesn't have to be hard, it can be easy enough for
frontend web developers IF the system is a streaming Master-Master database to
begin with. Ontop of that, check out our "backup to S3" prototype where we
scaled to doing 100M+ messages for $10/day (all costs, CPU, disk, S3) here:
[https://www.youtube.com/watch?v=x_WqBuEA7s8](https://www.youtube.com/watch?v=x_WqBuEA7s8)

My goal and argument here is that database vendors keep propagating the
message of "this is hard, so trust us and pay for systems" that Aphyr has
repeatedly proven to be broken (although, actually, Postgres did really well,
Kyle was recommending it as the best general purpose database) - as Craig
notes himself: "In fact, I’ve been on calls where we quoted $300K for the
services work, and never heard from that user again."

We need to break these cycles, and I do believe Craig is trying to do that
with these blog posts, which is great. But, we have a long ways to go (all of
us).

