
Citus Unforks from PostgreSQL, Goes Open Source - jamesheroku
https://www.citusdata.com/blog/17-ozgun-erdogan/403-citus-unforks-postgresql-goes-open-source
======
no1youknowz
This is awesome. I have experience with running a CitusDB cluster and it
pretty much solved a lot of the scaling problems I was having at the time. For
it to go open source now, is of huge benefit to the future projects I have.

> With the release of newly open sourced Citus v5.0, pg_shard's codebase has
> been merged into Citus...

This is fantastic, sounds like the setup process is much simpler.

I wonder if they have introduced the Active/Active Master solution they were
working on? I know before, there is 1 Master and multiple Worker nodes. The
solution before was to have a passive backup of the Master.

If say, they released the Active/Active Master later on this year. That's
huge. I can pretty much think of my DB solution as done at this point.

~~~
ozgune
(Ozgun from Citus Data)

We're working on making Citus masterless. In all openness, we evaluated two
different approaches to this in the past six months, and wrapped up the design
for one. This design works well on the cloud, and we already demonstrated a
working version:
[https://youtu.be/_nun2S6EdWo?t=411](https://youtu.be/_nun2S6EdWo?t=411)

For on-premise deployments, the primary challenge is set-up complexity. We're
now prototyping one of those designs to know more:
[https://github.com/citusdata/citus/issues/389](https://github.com/citusdata/citus/issues/389)

We expect to share all the details and a concrete timeline in April.

~~~
Florin_Andrei
Would it be possible (eventually) to use Citus for sharding within the
datacenter, and BDR for master/master replication between datacenters?

Or is Citus taking over the master/master replication? (or is it doing
something different?)

~~~
mslot
(Marco from Citus Data)

It seems at least theoretically possible.

Since the Citus master executes distributed queries by sending regular SQL
queries to the Citus workers, you could already use BDR servers as workers and
replicate the data between pairs of workers in different data centers and copy
over the metadata on the master manually. However, some distributed joins and
data loading features wouldn't work.

For all features to work, and to replicate the master, you would have to
compile Citus against BDR, which probably requires a few code changes.

~~~
Florin_Andrei
Postgres with sharding and master/master replication would be so awesome.

------
devit
I've been unable to find any clear description of the capabilities of Citus
and competing solutions (postgres-x2 seems the other leader).

Which of these are supported:

1\. Full PostgreSQL SQL language

2\. All isolation levels including Serializable (in the sense that they
actually provide the same guarantees as normal PostgreSQL)

3\. Never losing any committed data on sub-majority failures (i.e. synchronous
replication)

4\. Ability to automatically distribute the data (i.e. sharding)

5\. Ability to replicate the data instead or in addition to sharding

6\. Transactionally-correct read scalability

7\. Transactionally-correct write scalability where possible (i.e. multi-
master replication)

8\. Automatic configuration only requiring to specify some sort of "cluster
identifier" the node belongs to

~~~
ozgune
(Ozgun from Citus Data)

On PostgreSQL language support, we're updating our FAQ to have more
information: [https://www.citusdata.com/frequently-asked-
questions](https://www.citusdata.com/frequently-asked-questions) Since the
PostgreSQL manual (and its feature set) spans over 4K+ pages, we found that
the best way to think about Citus' capabilities is from a use-case standpoint.
If your workload needs distributed transactions that span across machines, or
large ETL jobs, Citus currently isn't the best fit.

Citus supports sharding and replication out of the box (#4, #5). On #6, reads
go through a master node (metadata server) and you see what you write.

We don't have #7. The way in which we implement this also has implications on
your other questions. Multi-master (no single metadata server) is by far the
biggest feature request that we receive:
[https://news.ycombinator.com/item?id=11353866](https://news.ycombinator.com/item?id=11353866)

 _If_ we go with the approach in
[https://github.com/citusdata/citus/issues/389](https://github.com/citusdata/citus/issues/389),
you will be able to configure #3, #6, #7 through PostgreSQL's streaming
replication settings. We still won't support distributed transactions that
span across multiple machines.

On #8, could you elaborate a bit more? Do you mean a logical identifier for
the node?

Also, it's hard to write a concise reply on a topic that requires so much
context. I'd love to grab coffee with anyone who's interested in diving deep
into distributed databases. Feel free to shoot me an email at
ozgun@citusdata.com

~~~
gorodetsky
Thanks for awesome product!

Do you know when you're planning to release Citrus 5.0 deb/rpm packages?

~~~
jasonmp85
(Jason from Citus here)

As soon as they're built in PGDG! Our Docker image just builds on the
PostgreSQL 9.5.1 image, then installs a .deb we built.

I've been wrapping up all our packaging work during the past week, but not
having a OSS release yet was the final blocker for getting into well-known
repos. We'll probably have a post about this in the near future.

------
exhilaration
AGPL license if anyone's curious:
[https://github.com/citusdata/citus/blob/master/LICENSE](https://github.com/citusdata/citus/blob/master/LICENSE)

~~~
gtrubetskoy
Which means there is no chance this would ever become part of PostgreSQL
proper.

~~~
rch
That's correct, and with such a significant license change I think the term
'unfork' is being used inappropriately in the title.

Edit: the PostGIS extension is GPL, and that license choice has been very
successful. Hopefully the AGPL works out at least as well for Citus, I'm just
not familiar enough to know what the implications will be in this context.

~~~
gtrubetskoy
>> the PostGIS extension is GPL

But then it doesn't really belong or aim to be in PostgreSQL proper, at least
IMO, so I think that's fine.

The functionality (i.e. distributed processing and horizontal scaling) that
Citus has done is something that I predict will eventually be part of standard
PostgreSQL, but it will not be Citus's code (unless they change the license,
of course).

~~~
rch
Alternative implementations will probably surface, but those will almost
certainly be packaged as extensions as well.

I think the model _could_ work in this case, both for the business and for
PostgreSQL in general.

------
gtrubetskoy
If anyone from Citus is reading this: how does this affect your business
model? I remember when I asked at Strata conf a couple of years ago why isn't
your stuff Open Source, the answer then was "because revenue". So what changed
since then?

~~~
atonse
My hunch is that the two are not really related.

Companies of any appreciable size will be happy to pay for support if they
choose to make Citus a part of their critical infrastructure. And the industry
reached an inflection point where there are enough companies want as much of
their infrastructure to be open source as possible, that you can run a company
where most of your stuff is open source, while still making a ton of money
(like RedHat, CoreOS, Docker, etc)

~~~
SwellJoe
I know Red Hat is making a ton of money. But, CoreOS and Docker, are they at
the "making a ton of money" stage, or merely well-funded by investors?

~~~
atonse
Good point. I wondered if I should edit that specific part, but kept it
anyway.

But a sign of getting investors is also that they see that there's still
potential of making a lot of money, in spite of being open source.

That's true here too. Citus mentioned they spoke to their board (and
presumably their investors too) about this change.

------
TY
This is awesome! Tebrikler (congrats) on the release of 5.0 and going OS,
definitely great news.

Can you publish competitive positioning of Citus vs Actian Matrix (nee
ParAccel) and Vertica? I'd love to compare them side by side - even if it's
just from your point of view :-)

~~~
flavor8
...and Redshift. I love what Amazon provide, but it gets expensive.

~~~
umur
(Part 3/3 - please see two comments below as the starting point) Aside from
the different use-cases they address, there is one other, important difference
between Citus and Redshift (and any other distributed database in the world,
for that matter). Citus does _not_ fork the underlying database, PostgreSQL.
Instead, Citus extends PostgreSQL to transform it into a parallel processing,
distributed database. We use PostgreSQL's powerful extension APIs to
accomplish this (you simply CREATE EXTENSION Citus on PostgreSQL's latest
version, 9.5, to get your distributed PostgreSQL database).

While this might appear as an implementation piece at first, it has important
product implications, and might even impact how you might want to think about
your database stack. By not forking the core database, you are choosing to
always stay with the core PostgreSQL product. For starters, you get the uber-
cool (and uber-fast) JSONB type that came with 9.4, or the recently checked in
UPSERTs, or the popular PostGIS extension for geospatial capabilities. More
philosophically, the moment you use forks of database, you know you'll be
diverging over time. And when you introduce new databases and/or piece
together many different ones to build one application, your development cycles
will only get costlier and more complex over time.

This was a long answer to a short question, but hopefully useful. Let me know
if you have questions, or any feedback using Citus – would love to hear your
thoughts!

~~~
buremba
I think that it would be better for you to position CitusDB by comparing it to
other products in terms of use cases.

If the data is big and I need to run analytic queries then I think I have to
use a columnar storage format because row-oriented formats cause too much
overhead for aggregation queries that usually need to process single column
efficiently. If I use CitusDB as an analytical database, then it's comparable
with Redshift, Hive etc. As you said, they're suitable for offline data but
Can I use cstore_fdw in CitusDB and able to take advantage of real-time nature
of Postgresql? Maybe I can push hot data to a table that use row-oriented
format and move the data periodically to another table that uses cstore_fdw
and execute queries that fetches data from both cold storage and hot storage
tables? If CitusDB makes it easy for me, then I think this is huge.

I guess another use case is using CitusDB as distributed data store and
executing filter queries such as "SELECT * FROM table WHERE partition_key = x
and predicate1 = y ...". Instead of using multiple Postgresql instances and
routing the queries in application level, I can just use CitusDB that takes
care of replication && query routing && sharding etc. I think it can also be
comparable to databases such as Cassandra, Mongo (using jsonb) since they also
have similar use-cases.

Or should I think CitusDB as distributed Postgresql?

~~~
mslot
(Marco from Citus Data)

> If I use CitusDB as an analytical database, then it's comparable with
> Redshift, Hive etc.

A particular difference is in response times and concurrency. Data warehouses
and Hive are great for reporting use-cases, but not for use-cases that require
fast responses and have many users like analytical dashboards. This is a use-
case for which Citus is particularly well-suited (see for example the
CloudFlare dashboard).

> Can I use cstore_fdw in CitusDB and able to take advantage of real-time
> nature of Postgresql?

Yes, since cstore_fdw and Citus are both developed by Citus Data we made sure
they're fully integrated. We've even seen some deployments that use a mixture
of columnar- and row-based storage in a single distributed table.

We find that row-based storage generally has better ingestion performance and
more indexing possibilities. Citus can do very fast execution of analytical
queries by parallelizing over row-based shards and using the indexes on each
of them. However, if you only need a small number of columns and have
analytical queries that are not very selective, you can use columnar storage
just as easily and even mix and match (might require some support).

> I guess another use case is using CitusDB as distributed data store

Yep, Citus can definitely be used for that by using hash-partitioned tables.

------
erikb
Unforking is a very smart decision. Postgres also has gained a lot of favour
since MySQL was bought by Oracle. Altogether Citus has earned a lot of kudos
for that move, at least with me, for all that may count!

------
faizshah
So this sounds similar to Pivotal's Greenplum which is also open source, can
anyone compare the two?

~~~
frn
Greenplum is based on postgres 8.2, with the featureset you'd expect from pg
8.2 - basically none of the additions after 2006 have merged to GP.

~~~
faizshah
Ok, and what's the process like for disaster recovery with citus?

~~~
craigkerstiens
That depends on your setup, for the master instance you'd run it just as you
would for other setups. Streaming replication is common there. For the sharded
instances, Citus has the ability for you to set what your replication factor
is. Here Citus is then aware of when a node fails and will automatically
redistribute the data to a new node, essentially taking care of that for you.

------
voctor
Citus can parallelize SQL queries across a cluster and across multiple CPU
cores. How does it compare with the upcoming 9.6 version of PostgreSQL which
will support parallel-able sequential scans, parallel joins and parallel
aggregate ?

~~~
lfittl
AFAIK all the parallel work done in 9.6 refers to parallel operations on a
single node (but multiple cores).

This would be complimentary to what Citus does, which is distributing the load
across multiple shard instances (each with their own cores, benefiting from
the parallel work in 9.6).

~~~
voctor
Yes, but Citus can also parallelize on multiple cores when used on a single
machine ("If you’re running Citus on a single machine, this will scale queries
across multiple CPU cores. and create the impression of sharding across
databases."). Will this functionality becomes obsolete with the 9.6 ?

~~~
lfittl
Fair point - I assume this will be merged together in some way (i.e. the Citus
stuff building on top of the parallel scan infrastructure), but probably a
better question to ask on #citus IRC / open a Github issue.

Some Postgres committers work on Citus as well (e.g. Andres Freud), so I'm
sure this has been thought through before.

------
azinman2
I want it to be called citrus, which is what I always read it as....

~~~
jrochkind1
Then I'd get confused and think it had something to do with Citrix.

~~~
azinman2
A cute lemon logo might help :)

------
rkrzr
This is fantastic news! Postgres does not have a terribly strong High
Availability story so far and of course it also does not scale out vertically.
I have looked at CitusDB in the past, but was always put off by its closed-
source nature. Opening it up seems like a great move for them and for all
Postgres users. I can imagine that a very active open-source community will
develop around it.

~~~
tlarkworthy
nit: Postgres doesn't scale horizontally, it only scales vertically.

~~~
manigandham
It doesnt scale vertically either. Postgres is single-threaded meaning it
can't make use of multiple CPU cores on the same machine. There have been some
slow improvements to this and 9.6 seems to hint at some parallel aggregation
changes but overall Postgres is strong in features but weak in scaling (in any
direction).

~~~
anarazel
"hint at"?

~~~
manigandham
[https://news.ycombinator.com/item?id=11331602](https://news.ycombinator.com/item?id=11331602)

It looks good, will have to wait and see what actually makes it into 9.6 and
how well it supports all the possible queries. This is something that many
commercial databases have solved so Postgres is pretty behind in vertical
scaling.

~~~
anarazel
To me it's a bit weird calling committed patches "hints".

~~~
tlarkworthy
British English vs. American English? I dunno if the poster is British but I
am. I read that they haven't done enough to call pg vertically scalable but
recent work suggestions (hints) that they are working toward it. The
expression is more metaphorical than literal, an Atlantic divide my American
colleagues explicitly warn me about when I am writing my perf reviews.

------
ccleve
I'd very much like to see what algorithm these systems are using to enable
transactions in a distributed environment. Are they just using straight two-
phase commit, and letting the whole transaction fail if a single server goes
down? Or are are they getting fancy and doing some kind of replication with
consensus?

~~~
wmfiv
I believe transactions must process against a single node.

------
lobster_johnson
This is great!

One thing I'm having trouble with is finding information about transactional
semantics. If I make several updates (to differently sharded keys) in a single
transaction, will the transaction boundaries be preserved (committed "locally"
first, then replicated atomically to shards)? Or will they fan out to
different shards with separate begin/commit statements? Or without
transactional boundaries at all?

In fact, I can't really find any information on how CitusDB achieves its
transparent sharding for queries and writes. Does it add triggers to
distributed tables to rewrite inserts, updates and deletes? Or are tables
renamed and replaced with foreign tables? I wish the documentation was a bit
more extensive.

------
signalnine
Congrats from Agari! We've been looking forward to this and continue to get a
lot of value from both the product and the top-notch support.

------
jjawssd
My guess is that Citus is making enough money from consulting that they don't
need to keep this code closed source when they can profit from free community-
driven growth while they are expanding their sales pipeline through
consulting.

~~~
craigkerstiens
Hi, Craig from Citus here. In addition to the open source Citus, we have some
premium features in our Enterprise edition. Many of these are ones that larger
enterprises will want to pay for such as security features around roles, a
tool for automated cluster resizing, and enhanced load balancing tools, and of
course support. Beyond that we have a few other things in the work that will
speak to various revenue models for the future.

------
ahachete
Congratulations, Citus.

Since I heard last year at PgConfSV that you will be releasing CitusDB 5.0 as
open source, I've been waiting for this moment to come.

It makes 9.5's awesome capabilities to be augmented with sharding and
distributed queries. While this targets real-time analytics and OLAP
scenarios, being an open source extension to 9.5 means that a whole lot of
users will benefit from this, even under more OLTP-like scenarios.

Now that Citus is open source, ToroDB will add a new CitusDB backend soon, to
scale-out the Citus way, rather than in a Mongo way :)

Keep up with the good work!

------
BinaryIdiot
I don't have a ton of experience scaling out and using different flavors of
PostgreSQL but I had run across Postgres-XL not long ago; does anyone know how
this compares to that?

------
ismail
Any thoughts on using something like postgres+citrus vs hadoop+hbase+ecosystem
vs druid for olap/analytics with very large volumes of data

------
X86BSD
AGPL? This is dead in the water :( It will never be integrated into PG. What a
shame. It should have been a 2 clause BSDL. Sigh.

~~~
tspiteri
The BSDL does not make much economic sense to the company open sourcing their
code; a new competitor would fork the code, make closed improvements, and
merge any changes from the open source code. That means that the competitor is
always gaining by a one-way flow of improvements.

To use open source code, the more permissive the license the better. But to
actually open your own code, BSDL is a very tough sell.

That's also why they use the AGPL. With database systems, even if they were
under the GPL, some competitors could just modify the system and run it on
their own server with improvements, and offer just the service to their
clients. Again, the improvements go one way only: since the competitor would
not distribute the modified system, as it's running on their servers, they
would not need to distribute source changes. With the AGPL, that loophole is
closed.

~~~
X86BSD
So you take a BSDL codebase, fork it, close it, make proprietary changes,
profiting from the BSDL codebase, then slap the PG community in the face by
open sourcing it under a more restrictive license hoping to benefit from the
community you just slapped in the face but restricting competition.

They are of course free to release their code under any license they wish. I
just think releasing code under the *GPL when you profited from a liberal BSDL
is a douche nozzle thing to do. But knock yourself out! This tells me all I
need to know about the company.

~~~
icebraining
I also have qualms about distributing changes to non-copyleft license code
under a copyleft license, but it seems strange to make this the "slap in the
face" moment - wasn't distributing them under a proprietary license even
worse?

~~~
X86BSD
No that IS what the BSDL allows for. I wasn't arguing that they shouldn't have
made a commercial product on BSDL code. That also was within their rights
using said BSDL code. It's just my _opinion_ that to do that, then release
your proprietary "bits" under the GPL or variation thereof with more
restrictions than what you started with instead of the same BSDL you used to
start your business is a dickheaded douche nozzle thing to do. But again they
are free to do that, it's within their rights! Just don't be shocked when
people like me call you out for what you are. Keep on rocking Citus! Stay
classy.

~~~
icebraining
I understand you weren't saying they don't have the right to do so, I just
find it weird that you consider distributing it under the AGPL to be douchy,
while not considering the distribution under a proprietary licence to be (even
more) douchy.

~~~
anonbanker
A relicense is a relicense. When you impose new rules on a product, especially
if you weren't the original author, is rude.

I'm a GPL zealot, to the point that I've used the GPL as a weapon and as a
shield against others in multiple capacities ("You own all my code when you
employ me? Sure, as long as I get to dictate the license"). However, I would
_never_ take someone's 2 or 3-clause BSD-licensed product, and relicense.
Those of us that value sofware freedom value the rights of other licenses that
believe the same.

We all see why it was done in this case, however; In order to ensure software
freedom in the cloud (someone else's computer), there isn't another license to
use. the BSD license completely breaks down in this use scenario, and the
_best we have_ is AGPL.

I'd say cloud usage is to 3-clause BSD what Tivo was to GPLv2.

------
satygeek
Does CitusDb fit in olap analytical workloads to do aggregations on hundreds
millions of records using varying order and size of dimensions (eg druid) in
max of 3 seconds response time using as few boxes as possible - Or there are
other techniques have to be used along with Citusdb? Can you shed a light on
your experience with CloudFlare in terms of cluster size and queries perf?

~~~
lfittl
I'll leave the initial questions to the Citus team, but re: CloudFlare this
link might be helpful:

[https://blog.cloudflare.com/scaling-out-postgresql-for-
cloud...](https://blog.cloudflare.com/scaling-out-postgresql-for-cloudflare-
analytics-using-citusdb/)

~~~
satygeek
Thanks. I went through it but couldnt find info about their cluster size, data
size and queries response time

------
uberneo
Great product - If would be nice to have a Admin interface like RethinkDB
where you can clearly define your replication and Sharding settings. Any
documentation around how to do this from command line ?

------
albasha
I recently switched back to MariaDB because I didn't see a clear/easy path for
Postgres scalability in case the project i am working on takes off. I am under
the assumption there are at least two fairly simple approaches to scale MySQL;
master-master replication using Galera and Aurora from AWS. What do you guys
think? Am I right in thinking MySQL is easier to scale given I want to spend
the least amount of time maintaining it.

------
Dowwie
would a natural evolutionary path for start ups be to emerge with postgresql
and grow to requiring citusdb?

------
onRoadAgain23
Being burned before,I will never use an OS infrastructure project that has
enterprise features you need to pay for. They always try to move you to paid
and make the OSS version unpleasant to use over time as soon as the bean
counters take over to milk you

"For customers with large production deployments, we also offer an enterprise
edition that comes with additional functionality"

~~~
gdulli
Do you often make big decisions based on extrapolation from so few data
points?

I use a major open source system with enterprise features and support but
don't pay for any of those options. I've used it for 3 years and it's been
invaluable. No pressure to start paying for anything. Some of its premium
features have actually become free over that period. But I wouldn't decide
that all open source systems with premium features are safe based on that
experience.

~~~
onRoadAgain23
If this costed roughly a million dollars, then yes. Especially if you're
locked in like with a DB. I use nginx because even though it has this mode it
would be easy to replace with something else.

------
ioltas
Congrats to all for the release. That's a lot of work accomplished.

------
ksec
Does anyone know How does Citus compared to Postgre XL ?

------
Someone
One must thank them for open sourcing this, and cannot blame them for using a
different license, but using a different license makes me think calling this
"unfork" is bending the truth a little bit.

~~~
takeda
Perhaps I'm missing something, but this is just an extension that works with
standard postgres, there are no code changes in postgres itself, so it doesn't
look like it ever was a fork.

~~~
jasonmp85
(Jason from Citus here)

Yes, that's what you're seeing _right now_ , but in the past Citus (used to be
"CitusDB") was a superset of the entire PostgreSQL codebase. During the lead-
up to the open source release, we removed the use of any static methods or
internal machinery and rewrote the installation process to use the PostgreSQL
CREATE EXTENSION command. Additionally, we moved all of pg_shard's DML
functionality into Citus to unify the product line.

So ultimately CitusDB _was_ a fork but is now entirely an extension.

~~~
Someone
Aha. Should have read the announcement better. Thanks for the explanation.

------
lambdafunc
Any benchmarks comparing CitusDB against Presto?

------
Dowwie
is it correct to compare citusdb with pipelinedb?

