Cost - Our primary data store has >1 Petabyte of raw data stored across dozens of Postgres instances. The amount of data we store is at the point where RDS is too expensive for us. The cost of an instance on RDS is more than twice the cost on EC2. For example, an on-demand r4.8xl on EC2 instance costs $2.13 an hour, while an RDS r4.8xl costs $4.80 an hour.
Performance - The only kind of disk available on RDS is EBS. EBS is slow compared to the NVMe the i3s provide. We used to use r3s with EBS and got a major speedup when we switched to i3s. As a side note, the cost of an i3 is also less than the cost of an r3 with an equivalent amount of EBS.
Configuration - By using EC2 we can configure our machines in ways we wouldn't be able to if we used RDS. For example, we run ZFS on our EC2 instances which compresses our data by 2x. By compressing our data, we get a major cost saving and a major performance boost at the same time! There isn't an easy way to compress your data if you use RDS.
Introspection - There are times where we've needed to debug performance problems with Postgres and EXPLAIN ANALYZE won't suffice. A good example is we used flame graphs to see what Postgres was using CPU for. We made a small change that resulted in a 10x improvement to ingestion throughput. If you are curious, I wrote a blog post on this investigation: https://heapanalytics.com/blog/engineering/basic-performance...
You can also run get bare metal I3 instances by launching the "i3.metal" instance type. You don't need to wait for the Nitro hypervisor, you can go with no hypervisor at all.
Since you work at Amazon, do you have a sense of big of a difference there is in performance between i3 and i3.metal for database workloads like Postgres?
As for stability, have been two major sources of instability with ZFS:
The first issue was with the default value of arc_shrink_shift. By default, ZFS will evict ~1% of ARC, the in memory file cache, to disk at a time. Our machines have several hundred gigs of ARC, so ZFS was evicting several gigs of data to disk at a time. This was causing our machines to frequently become unresponsive for several seconds.
The other issue is for some reason ZFS will lock up for long periods of time if we delete several hundred gigs of data. We haven't been able to identify a root cause of the problem. So far we've worked around this problem by adding a sleep in between data deletions.
Other than these problems, ZFS has worked pretty well for us.
How do you manage this?
Also, how frequently do i3 instances fail?
Over the course of a month, we usually have about one machine fail.