

Scaling Out PostgreSQL for CloudFlare Analytics Using CitusDB (YC S11) - jgrahamc
https://blog.cloudflare.com/scaling-out-postgresql-for-cloudflare-analytics-using-citusdb/

======
gane5h
Really cool write up – thanks! First time I’m hearing about CitusDB. They
appear to be building a columnar, distributed database while preserving the
Postgres frontend (similar to redshift, aster, greenplum, etc.)

It’s all in the details. I’m planning to investigate the following during my
next weekend hack. Hope somebody can answer some pre-sales questions for me:

    
    
      - how complete is the postgres functionality (e.g.: lateral joins)
      - can you set a sharding key to control the shard distribution
      - does the database do multiple passes for queries with subselects
      - usually one increases the replication factor (limited by budget) to improve query times, with the limitation that it slows down loading time. does the DB stage intermediate writes to batch them, so does the user need to do this? this works really well for append-only, timestamped event data.
      - do you have a job manager or scheduler, needed when you have multiple views that need to be updated without melting your infrastructure
      - how easy is it to operate? does the database expose operational metrics so that you can see the load on each shard to potentially detect unbalanced shards?
      - tips on hardware configuration (big advantage of redshift here is that you don’t have to run your own warehouse.) maybe partner with MongoHQ?
    

It’ll be nice to see some sample query plans graphically visualized.

------
jaytaylor
This was a great read.

Question/thoughts regarding CitusDB:

It looks to be really cool, and also proprietary.

On the heels of the recent abrupt FoundationDB shutdown after being acquired
by Apple, I'm apprehensive and reluctant to even consider investing more
energy into proprietary datastores.

I'm torn because I love shiny future tech from outer space, but the FDB burn
felt horrible to me.

I'm keen to hear thoughts on other perspectives which might help me figure out
a better balance or attitude on these matters.

~~~
EdwardDiego
I can't think of any columnar DBs that are FOSS aside from Cloudera's Impala
(under the Apache licence), and IMO while it represents as a DB, it's a
sometimes leaky abstraction over the Hadoop ecosystem, and its used columnar
format, Parquet, has some data-type limitations compared to other products.

But I agree - we've had some "fun" with our columnar DB's vendor, and the
support we're paying so much for has been rather useless.

~~~
teraflop
Impala is cool but it's a data warehouse, not a full-featured DBMS. The
biggest difference is that it only supports batch inserts, so forget about
UPDATE/DELETE queries.

In any case, CitusDB's home page doesn't say anything about it being a
columnar database, and this blog post says it uses the same storage engine as
PostgreSQL.

~~~
EdwardDiego
> Impala is cool but it's a data warehouse, not a full-featured DBMS.

I wouldn't even call it that to be honest, you spend a lot of time thinking
about Hadoop and HDFS files when working with Impala.

> In any case, CitusDB's home page doesn't say anything about it being a
> columnar database

It does, if you dig deep enough. :) [https://www.citusdata.com/citus-
products/cstore-fdw](https://www.citusdata.com/citus-products/cstore-fdw)

But you're right that it's completely optional and not needed to access their
distributed query processing.

