
Ask HN: Citus/Postgres as an alternative to NoSQL - logn
Is using the Citus extension to Postgres a sensible and scalable alternative to NoSQL databases? DynamoDB seems really expensive depending on the use case and managing Cassandra seems harder than Postgres.<p>I&#x27;m wondering if my hunches here are valid.<p>Note that Citus MX is in beta and will eventually be open source. It allows fully decentralized nodes (no master node bottleneck).
======
brandur
I'll bite despite not being the most qualified person around to answer the
question :)

First of all. It's well considered to think about alternatives to NoSQL. You
lose _so much_ correctness and resilience by moving to systems like Mongo and
Cassandra that throw ACID out the window. At some point of scale you're going
to need to loosen your data constraints to allow further growth, but most
people aren't actually going to get to that point, and of those left over,
practicing some basic pragmatic data austerity may allow you to avoid
"webscale" databases altogether.

Regarding Dynamo: I've used it on fairly large projects at two companies now.
It _mostly_ works as advertised, but its nature means that you're constantly
fighting the system if you ever need to access data in a way that you hadn't
originally considered. You can scan relatively cheaply along an index, but if
you have any other predicate condition that needs to be added, you'll have to
fall back to seq scanning the table. This is obviously slow, but also costly
because it'll have a huge impact on your provisioned read throughput.

On both projects we eventually had to think about moving off on Dynamo because
of serious troubles with ingestion bottlenecks. I wasn't directly involved in
either project so I have a little trouble remembering the exact details, but
both problems were essentially related to the partitioning balance of incoming
data. You can provision as much write throughput as you want, but if you're
data isn't balanced perfectly by partition key as it comes in, single shards
can become exhausted and put severe pressure on the entire system.

Regarding Citus: I still haven't used it directly, but Postgres is obviously a
great system, and Citus seems to be a nice augmentation for its functionality.
The fact that it can guarantee atomicity, consistency, and isolation within
the bounds of a single shard is potentially _huge_ for allowing you to
maintain incredible robustness even at scale.

I had a casual conversation with a CloudFlare engineer once wherein I asked
them about the dirt, but they didn't give me any and seemed to be fairly
content with the product.

Regarding Cassandra: again, no direct experience. I guess this is where you
eventually need to go if you scale out far enough, but it seems like an
incredible pain to setup, build with, and manage.

------
mslot
(Marco from Citus Data)

It depends on the use-case, but PostgreSQL and Citus can be a sensible
alternative to NoSQL.

RDBMS and NoSQL are on the opposite ends of the scaling spectrum. RDMBS
provide very rich functionality and put few restrictions on your data model,
but require the data to be stored one big lump on a single machine. NoSQL
databases provide minimal functionality and put severe restrictions on your
data model (must be a key->value set), but can distribute the data across many
machines.

Citus sits somewhere in the middle. The ideal use-case for Citus is one in
which your data is kind of lumpy/relational, but can still be distributed by a
particular key. The most notable example is multi-tenancy [1]. If you're
running a (B2B) SaaS application with many customers, your queries are usually
specific to a particular tenant, which means you can probably add a tenant_id
column to all your tables and distribute the tables by that column. Citus
automatically co-locates [2] data to support joins between different tables on
tenant_id. You can then use _full SQL_ and ACID transactions as long as you're
always filtering by one specific tenant, and a useful subset of SQL for
parallel queries across all tenants (incl. inner/outer joins). Citus is far
more capable than most databases in this area, partly because it can build on
the querying features of PostgreSQL, rather than having to implement them from
scratch.

The main advantage of using Citus over PostgreSQL in this case is that you can
scale out memory, CPU, and storage space to keep up with your read workload.
In addition, it provides parallel aggregations through INSERT..SELECT and
parallel indexing when performing a COPY, allowing you to keep up with bulk
write workloads. The main drawback is that you cannot perform arbitrary SQL
queries across all the data and some queries may require a different data
model.

The main advantage of using Citus over NoSQL in this case is that you can use
arbitrary single-tenant SQL queries and indexes, which makes data modelling
far simpler and thus cheaper. Being able to perform parallel cross-tenant
analytical queries and parallel aggregation [3] also avoids the need for
maintaining separate systems for analytics (again, cheaper). Bulk loading
through COPY can provide extremely high write throughput at relatively low
cost [4]. A drawback is that, to provide consistency, Citus has designated
primary nodes for taking writes. If a primary goes down there will be ~1
minute of down-time for the keys stored on that primary until the secondary
takes over (similar to e.g. Kafka). However, this is not a huge concern for
most applications, especially considering that PostgreSQL is very stable
software so crashes are usually only hardware related.

In a pure key-value storage use-case, the extra functionality provided by
Citus is not as beneficial as it is in the multi-tenant use-case, though
features like JSONB, GIN indexes, SQL functions and ACID transactions can
still be invaluable. In terms of short request throughput, regular Citus is
comparable to a single PostgreSQL node in being able to sustain around ~50k
writes/sec, which is less than typical NoSQL databases. However, if you need
more than 50k/s writes, then Citus MX will be able to scale out your write
workload horizontally [5].

For key-value use-cases, I would say Citus (MX) is competitive with NoSQL
databases, but not specifically better or worse, just making different trade-
offs. The big advantage is mainly that, if one day you decide you also need
analytics, or a new query, or a new filter, or a new conversion, then you
don't need to deploy a new system and probably won't have to remodel your
data.

[1] [https://www.citusdata.com/blog/2016/08/10/sharding-for-a-
mul...](https://www.citusdata.com/blog/2016/08/10/sharding-for-a-multi-tenant-
app-with-postgres/)

[2]
[https://www.citusdata.com/blog/2016/12/22/scaling_out_sql_wi...](https://www.citusdata.com/blog/2016/12/22/scaling_out_sql_with-
colocation/)

[3] [https://www.citusdata.com/blog/2016/11/29/event-
aggregation-...](https://www.citusdata.com/blog/2016/11/29/event-aggregation-
at-scale-with-postgresql/)

[4] [https://www.citusdata.com/blog/2016/06/15/copy-postgresql-
di...](https://www.citusdata.com/blog/2016/06/15/copy-postgresql-distributed-
tables/)

[5] [https://www.citusdata.com/blog/2016/09/22/announcing-
citus-m...](https://www.citusdata.com/blog/2016/09/22/announcing-citus-mx/)

