
Citus 6.1 Released – Horizontally scale your Postgres database - gdb
https://www.citusdata.com/blog/2017/02/16/citus-61-released/
======
koolba
Are the commands in the examples the actual SQL commands for sharding and is
the default to have them in the public namespace?

Ex:

    
    
        CREATE TABLE states (...)
        -- distribute it to all workers
        SELECT create_reference_table('states');
    
        SELECT isolate_tenant_to_new_shard('table_name', tenant_id);
    

While isolate_tenant_to_new_shard() doesn't seem like it'd clash with
anything, create_reference_table() seems common enough to exist in someone's
code. Why not have this in a citus schema by default?

> In Citus 6.1 Vacuum is now distributed and run in parallel automatically
> across your cluster.

Did VACUUM prior to 6.1 require one node in the cluster to issue locks on
other nodes? If so what for? I'm not intimately familiar with how the nodes in
the cluster communicate but would have figured each is acting as a standalone
DB and coordinating common updates via 2PC (which means VACUUM was already
distributed).

~~~
ozgune
Fair point on creating these functions in a Citus schema. The reason for this
is mostly historical. We started with pg_catalog, haven't run into issues yet,
and therefore didn't prioritize changing this [1].

On the VACUUM table side, when the user ran this command, we previously didn't
propagate vacuum to the related shards. With this release, we run this command
in parallel across worker nodes.
([https://github.com/citusdata/citus/issues/719](https://github.com/citusdata/citus/issues/719))

[1] Citus' user-defined functions aren't in the public, but rather in the
pg_catalog schema. If the user defined their create_reference_table(), that
definition would override the Citus definition. You'd then need to call the
Citus function by fully qualifying its name or changing your search_path.

As a footnote, if a user defines a create_reference_table() function, it would
only override the Citus function if the two functions have the same signature:
create_reference_table(table_name regclass).

(edited)

------
crudbug
I just started learning Citus, it is a very promising product.

Couple of things for improvement / may be already supported.

1\. New User records do not propagate to all the nodes - Manual Step.

2\. New Database records do not propagate to all the nodes - Manual Step.

3\. Materialized View with incremental refresh - Currently Postgres re-runs
the query everytime there is a data change - For a table of billion records
this is very inefficient.

~~~
anarazel
> 1\. New User records do not propagate to all the nodes - Manual Step. > 2\.
> New Database records do not propagate to all the nodes - Manual Step.

The issue here is that both databases, users and some other objects aren't
"database local" but "cluster wide" objects (i.e. visible in all the databases
of a postgresql installation). As the citus extension isn't necessarily
created in all of them we can't reliably do anything about this...

------
tpetry
Automatic failover for master nodes would be a really nice feature making
citus a hassle-free solution.

~~~
craigkerstiens
Hi, Citus Cloud our managed service does have automatic failover available for
both the master and the distributed nodes.

~~~
tpetry
I know, but i was speaking for the community or enterprise edition. Managed
services are sometimes not allowed for legal reasons (data privacy).

------
rattray
Can any users of Citus comment on the difficulty/ease of adoption, tradeoffs,
and benefits?

~~~
rattray
Another way of asking: Citus seems amazing. What's the worst part of using it
(other than cost)?

~~~
agentgt
I only have about a month experience but the real issue is you have to do a
lot of things manually that are done automatically in other systems like
Cassandra.

Particularly rollups and stream like things. That is you have to do the
rollups (as well stream like things) either through postgres rules,triggers
Listen/Notify, cron etc. With system like Druid (Cassandra), Elastic Search
and other time series databases this is not needed (either the query ability
is efficient enough because of compressed or it is automatically indexed or
both).

For the stream like stuff we have been using Pipelinedb. Pipelinedb is
annoying though in that it is a fork and not an extension. Pipelinedb has a
similar problem. You have to create the streams apriori.

So both for Citus and Pipelinedb you have to do some planning of your schema
where as the other guys you can do more exploratory/adhoc analysis and
queries. The disadvantage to the NoSQL guys (besides maturity and literally no
SQL) is that you pay a lot more in memory as most of them (IMO cheat) and rely
heavily on available memory. Postgres is very memory friendly.

~~~
rattray
Hmm, interesting. Sounds like a problem with a Postgres, not Citus, but it
still sounds like a disadvantage compared to competing products (RethinkDB
also comes to mind).

I haven't implemented anything for that before personally; would a tool like
Bottled Water [0] be helpful for this usecase?

[0] [https://github.com/confluentinc/bottledwater-
pg](https://github.com/confluentinc/bottledwater-pg)

~~~
agentgt
Yes Postgres has disadvantages (I accidentally posted my comment too soon so
that might have caused some confusion) but like I said it is: 1. Mature, 2.
SQL is damn powerful, 3. Memory friendly (aka cost).

RethinkDB is cool. I just never got into it to know how well it actually
works. As for bottledwater we use Postgres + AMQP aka Rabbit + Listen/Notify.
I'm not really a big fan of kafka even though it does have amazing scaling
ability.

------
matt_wulfeck
> _Microservices and NoSQL get a lot of hype, but in many cases what you
> really want is a relational database that simply works, and can easily scale
> as your application data grows._

I believe this product is very threatened right now by Google's newly announce
Cloud Spanner database[1]. Even with services like this that make scaling
"easy", it doesn't yet make it transparent like Google is striving to do.

[1] [https://cloudplatform.googleblog.com/2017/02/introducing-
Clo...](https://cloudplatform.googleblog.com/2017/02/introducing-Cloud-
Spanner-a-global-database-service-for-mission-critical-applications.html)

~~~
craigkerstiens
Craig here from Citus. Spanner is absolutely some interesting technology, but
focuses on some pretty different use cases than we're looking to solve.
Spanner is globally distributed but the cost of that is likely to be 100ms per
operation. Further it uses SQL for the read, but the insert and update is RPC
so it's a bit of a different interface.

Citus is much more focused on interacting like a relational database that
simply scales out as opposed to re-inventing a lot of the underlying tech and
interfaces. You can find the comment further down from one of our founders,
but we're very focused on a few use cases.

1\. Scaling beyond single node Postgres (this primarily for B2B applications
currently)

2\. Operational analytics, here because of the way we parallelize the workload
you can get sub-second query performance across terabytes of data.

3\. NoSQL++, this is a bit more akin to where spanner may fit, where you have
exceptionally high read and write throughput requirements. (Exceptionally high
meaning north of 100k writes per second).

