Hacker News new | past | comments | ask | show | jobs | submit login

Our product offers near real time (incremental sync every 3 mins) backup and restore solution for on-premise SQLServer data. We have used PostgreSQL to store the backup data in cloud and offered add-on services on top of PostgreSQL data (such as reports, analytics, etc). Every customer data is stored as a separate database.

Initially we have used RDS as PostgreSQL instance when the product is in pilot phase. RDS costed us $548 for just 2vCPU 16GB 500GB SSD (db.r5.large Multi-AZ). Considering the increasing active customer base and volume of data involved, we found that RDS is very expensive and costed us more (> 100%) than the market affordable estimated product pricing (:facepalm:). As per our performance benchmark, we found that db.r5.large can accommodate 250 customers and scalable linearly. To reduce the RDS spend, we had to reduce two costs.

1) Reduce the RDS Instance cost / customer - we aggressively optimised our sync flow and final benchmark reveals 500 customers can be accommodated in db.r5.large Instance (50% less RDS Instance spend / customer)

2) Reduce the RDS Storage cost / GB of customer data - we could not find any way to reduce the storage cost. Since RDS Instance is fully managed by AWS, no possibility of data compression.

When we compared the total cost based on our usage, RDS Instance cost is just 10-20% and Storage cost is 80-90%. So finally we decided to host our PostgreSQL Instance in EC2 with transparent data compression. This is our current configuration and usage metrics.

r5a.xlarge (4vCPU 32GB)

PG Master - Availabilty Zone 1

PG Slave - Availabilty Zone 2 (Streaming Replication)


8 x 100GB ZFS RAID0 FileSystem with LZ4 Compression (128KB recordsize)

40GB (wal files) ZFS FileSystem with LZ4 Compression (1MB recordsize)

600GB Compressed Data (3.1TB Uncompressed - x5.18 compression ratio)


2 x r5a.xlarge - 2 x $104.68 = $209.36

2 x 8 x 100GB - 2 x 8 x $11.40 = $182.40

2 x 40GB - 2 x $4.56 = $9.12

Total EC2 Cost = $400.88


If we had to use RDS

db.r5.xlarge Multi-AZ = $834.48

3.5TB Multi-AZ = $917.00

Total RDS Cost = $1751.48


So we have reduced our cost by $1751.48 / month (greater than x4 times) by using EC2 instead of RDS. Best of all we have purchased 3 Years No Upfront Savings Plan which further reduced our EC2 Instance cost to $105 (50% reduction). RDS doesn't have No Upfront Reserved plan for 3 Years and for 1 Year No Upfront we get just 32% Instance cost reduction.

Apart from the direct EC2 Instance cost and Storage size reduction, major benefit we indirectly got by migrating to EC2 Instances is

- IO Throughput increased by x5 due to ZFS LZ4 Compression. Importing of 3GB Compressed GZIP file would take around 2.5 - 3 hrs in RDS whereas in EC2 it just takes 30 - 45 min.

- Existing Savings Plan discount automatically applied (50% reduction)

- Ability to migrate to AMD based Instances (r5a.xlarge) - 50% reduction compared to Intel based Instances (r5.xlarge) in Mumbai region. It'll take ages before AMD based Instances are available in RDS.

- Ridiculous EBS Burst Credits by using 8 Volumes in RAID0. Base IOPS - 8 x 300 (100GB) = 2400 IOPS. Burst IOPS Credits - 8 x 3000 = 24000 IOPS :D

- PG Master is used for backup sync write operations and PG Slave is used for reporting and analytics. RDS requires Read Replica to be created from the already existing db.r5.xlarge Multi-AZ Instance for read operation which will further increase the estimated RDS cost by x1.5

- Planning to migrate to AWS Graviton 2 ARM64 Instances. AMD (r5a.xlarge) and Intel (r5.xlarge) based Instances have hyperthreading enabled which leaves us with just 2 real cores and 4 threads. But basic Graviton 2 Instance r6g.large itself has 2 real cores. So I'm kinda estimating that the basic r6g.large (2vCPU 16GB) Instance itself can support upto 1000 Active Customers (further 50% EC2 Instance Cost reduction).

I assume you're using ZFS filesystem for the transparant compression, what's your opninion/experience on using ZFS on cloud storage? I mean; the EBS disks are already redundantly stored by AWS and the COW mechanism could lead to a lot of write amplification; negatively impacting the network attached storage?

(I don't use EBS in my day job, but Azure's disk offering don't really offer adequate perforamance when used with any filesystem other then EXT4 in my experience)

There is a small write amplification due to pg page size being 8KB and ZFS recordsize being 128KB, but considering the Bulk write nature there is not much impact. Also max IO size of EBS is 256KB which helps us to optimally utilise available IOPS even if there is write amplification. Reducing the ZFS recordsize significantly reduces compression ratio so we kept as it is. If it's an OLTP application, reducing the recordsize will improve latency but for these bulk operations, it's the most suitable.

I haven't used Azure, but based on my raw benchmark and real time usage, I would say type of FileSystem doesn't affect performance of EBS Volumes. Our peak IO usage is 1200 IOPS and 20 MB/s. I would say similar RDS configuration would have x4 - x10 write amplification due to data being not compressed.

> we have reduced our cost by $1751.48 / month

It doesn't look like a huge win given how much complexity you added, while RDS manages it for you.

I guess $1751 / month isn't a big deal in developed countries. But in India this is a lot. Also if I include the RDS Read Replica in total RDS cost, it comes down to $2627 / month (~ INR 1.93 Lakhs / month). Here this is equivalent to 5 Junior Developer Salary / month.

Based on our current customer base of backup and restore solution with addons, AWS spend is about 12% - 16% of the total product revenue. Our company has about 5000+ Active Customer base where the core product offering is different. Backup and Restore solution is itself an add-on. If we would have priced this considerably larger due to larger RDS spend, then it won't be surprising even if we get just only 10% of the current addon customers (700+).

Plus I would say this isn't a much complexity, everything is automated using Terraform and Ansible - pg installation, streaming replication setup, ZFS RAID0 setup, etc... Not a single command is executed in our EC2 Instances manually. The only benefit we get from RDS is the failover capability with minimal downtime. But for that, x4-x5 increased RDS cost isn't worth for us.

We still use RDS for OLTP and service databases, but not affordable for the backup offering.

Aren't you getting killed on inter-AZ bandwidth costs with streaming replication?

This was our experience when we tried it.

Our total Inter-AZ bandwidth usage is about 2TB - 3TB / month which comes around to just $20 - $30 / month. We are planning to introduce SSL with compression between Master Slave setup to further reduce Inter-AZ bandwidth, but this isn't taken up yet.

Wow, thanks for your detailed comment. Learned a few things in here for our own EC2 deployment of PG.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact