
Introducing Citus Cloud - craigkerstiens
https://www.citusdata.com/blog/1773-craig-kerstiens/414-introducing-citus-cloud
======
brandur
> Data durability

>

> As a data service our first priority is keeping your data safe. We utilize
> WAL-E (link is external), the popular continuous archiving tool for
> Postgres.

A slight aside, but it's definitely worth looking at Postgres' WAL system [1],
which is very cool technology that recently got better in 9.4 in that it can
stream a "logical" representation of the WAL instead of the previous format
which was largely only good for internal use. The WAL-E project that persists
WAL to S3 was originally started for Heroku Postgres, but is now in widespread
use elsewhere including the new Citus initiative.

Anyway, I'm a little biased because this was bootstrapped by ex-colleagues,
but I'm excited to see an alternative to Redshift (not necessarily the target
competitor here, but one interesting technology that Citus could viably
replace) that allows for more consistent query performance and has all the
modern features of Postgres (i.e. as opposed to being permanently locked into
8.0.2 with Redshift).

[1] [http://www.postgresql.org/docs/current/static/wal-
intro.html](http://www.postgresql.org/docs/current/static/wal-intro.html)

[2] [http://www.postgresql.org/docs/9.4/static/logicaldecoding-
ex...](http://www.postgresql.org/docs/9.4/static/logicaldecoding-
explanation.html)

[3] [https://github.com/wal-e/wal-e](https://github.com/wal-e/wal-e)

~~~
snuxoll
For on-site usage I'm pretty happy with pgbarman from 2ndQuadrant, no need for
S3/Azure/Swift (base backups are archived daily to our off-site storage).

------
pbarnes_1
You guys really need to do this on GCP where there's zero Postgres
competition.

Since GCP will be overtaking AWS in like 2-3 years (hah), good to get a head
start now. ;)

~~~
craigkerstiens
Thanks for the input, we're very much considering other infrastructure
providers to run on. All input and requests for specific providers is helpful
input for how we prioritize.

~~~
brandur
> Plans start at $1,800 a month, with the introductory plan including a total
> of 48 GB of memory for your database (one primary node and two compute
> nodes, each with 16 GB of RAM and two cores).

You may need something that's a little more cost accessible before it's widely
useful ;)

~~~
vbit
+1

Or you may want to allow a one-click solution that I can deploy on my
resources now and later seamlessly move to your hosted solution, when I need
to scale.

------
nzoschke
Really excited about this launch.

Citus is starts on an unprecedented level. Postgres is some of the best
software ever invented already, and the team is continuing the tradition of
continuing to add the best features to the engine.

I'd pick Citus over (almost%) any NoSQL solution at this point. Great data
guarantees while keeping decades of engineering, tooling and know how.

Then I'm biased having worked with them for years but the team behind Citus
Cloud is awesome. They understand both Postgres and how to build and operate
reliable services better than anyone out there.

Congrats on the launch, Citus!

% I'm using DynamoDB right now because I don't have big data needs. But I miss
developing against Postgres dearly

------
ddorian43
Since this is for olap (by the wording), meaning latency is not a big problem,
why not just go on dedicated servers on hetzner/etc and lower the cost by ,
say, 3x (probably more)?

~~~
gtaylor
Because then you aren't on AWS. "Just going to Hetzner" means mostly losing
everything else that AWS offers. You can still use some services externally,
but the latency is higher and there are ingress/egress costs to contend with.

A reminder: AWS is not just a VM host. If you are using it this way, you are
throwing away money. You use AWS to delegate big chunks of your infrastructure
or systems to Amazon, thus saving on manpower (and thus, salaries). Hetzner is
just a bare metal provider. Apples to oranges.

There are some sites on AWS who deal with staggering volumes of traffic that
have tiny infrastructure or ops teams. _That_ is the main value proposition of
AWS (and GCP and Azure), and why Citus is focusing on it.

~~~
ddorian43
....... what else is citus using in this case for aws except the virtual-
servers + s3 ?

Example, this is what algolia does (search engine, keep index in ram for 99%
of users).

~~~
gtaylor
That's an implementation detail that doesn't get at the root of _why_ they are
snuggling up to AWS. Their _customers_ aren't just using virtual servers + S3.
_That_ is the important tidbit to grasp.

If I'm a customer that is using RDS, ElastiCache, S3, and who knows what else,
I'm not just going to switch to Hetzner because I take on responsibility that
I had previously delegated to AWS.

~~~
ddorian43
But the customer isn't switching. Hell, citus could even make the import
themself (by having servers in aws to read from your s3 and push to their own
servers in hetzner so you don't pay bandwidth prices). Just like people can
use algora without leaving aws.

unrelated: Have you ever thought, that the bandwidth pricing on the clouds is
fucking insane ?

------
mattsoldo
It's great to see a lot of the good DNA from the Heroku Postgres team
reflected in this release. Congrats on the launch.

------
DevKoala
Does Citus keeps it's performance over tables with tens of billions of
records? Also, how fast it is for ad-hoc queries over data coming from streams
(Kafka/Kinesis) that has not been cached?

~~~
spathak
Sumedh from Citus Data here.

> Does Citus keeps it's performance over tables with tens of billions of
> records?

Citus essentially shards the data across machines, and queries these in
parallel. You can thus scale out your cluster and CPU cores as you add more
data and maintain performance.

> Also, how fast it is for ad-hoc queries over data coming from streams
> (Kafka/Kinesis) that has not been cached?

By 'cached', do you mean OS or database caching in-memory? Query performance
for on-disk data is as fast as you can get with regular PostgreSQL, since each
data node is essentially a PostgreSQL node, and each shard a regular
PostgreSQL table. Standard tuning like indexes and Postgres configuration
parameters will apply here.

~~~
Tharkun
Not every query is parallelizable. Maintaining performance is a lie. An easy
to grasp is example is computing a median. And I mean an exact median, not an
approximation.

~~~
spathak
@Tharkun: You are right that not every query is immediately parallelizable.
Distinct count's are another example. In some cases data can be re-partitioned
so we can calculate exact values and push down computation in parallel. This
may provide better performance than a single large table, so there are still
benefits to it. Ultimately though there will be tradeoffs to moving to an
entirely distributed environment, but depending on the use-case the value may
offset those.

------
rodionos
> If you're dealing with event or time series data—whether > user messages,
> logs, web analytics, or connected device > events—scaling out to both store
> and analyze terabytes of > data becomes trivial with Citus Cloud

Speaking of time series use cases. PostgreSQL uses 30 to 100 bytes per
long/float tuple, depending on schema. It's hard to see how it can it be
competitive with TSDBs that are often in the single byte digits these days?

~~~
andrewstuart2
Are you referring to something different than the double precision floating
point on the types page [1]? Because according to the docs, it uses 8 bytes,
in compliance with IEEE 754.

[1] [http://www.postgresql.org/docs/9.5/static/datatype-
numeric.h...](http://www.postgresql.org/docs/9.5/static/datatype-numeric.html)

~~~
rodionos
Table schema, with indexes disabled:

> CREATE TABLE SensorData ( Sensor INT NOT NULL, FOREIGN KEY (Sensor)
> REFERENCES Sensors(Id), Humidity FLOAT, Precipitation FLOAT, Temperature
> FLOAT, SampleTime TIMESTAMP(3) NOT NULL, PRIMARY KEY (Sensor, SampleTime) );

Size on disk 80MB after 2.5 mln inserts
{sensor,temperature,humidity,precipitation}.

SELECT pg_size_pretty(pg_total_relation_size('"measurements"'))

~~~
andrewstuart2
80MB * 1024 * 1024 / 2500000 rows = 33 bytes/row

Schema:

    
    
        Sensor: 4B
        Humidity: 4B
        Precipitation: 4B
        Temperature: 4B
        Timestamp: 8B
        Null fields bitmap: 1B (< 8 fields)
    
        Total: 4 * 4 + 8 + 1 = 24 Bytes.
    

Consider a few extra bytes per page for page headers, probably some row-level
overhead I missed, and the size of the b-tree index created for your primary
key, and that's about the size I'd expect, and without performance, precision,
or some other losses, you're not going to see time series databases get much
smaller for the same schemas.

~~~
rodionos
It seems that float(p) with p undefined is treated as double. I need to double
check my numbers. The storage test was done against 9.4 Indexes were turned
off. Adding (Sensor, SampleTime DESC) index increases disk usage by 30%.

Not sure what you mean by precision or performance loss. TSDBs are way faster,
I think it's a rather established fact because they're optimized for numeric
array ingestion.

~~~
andrewstuart2
I'm not an expert on time series databases, but the main thing that I can
imagine they do differently is data storage order. In SQL world, the physical
order of the data is simply determined by the clustered index. If you were to
want your data ordered by time, you'd make an index on a timestamp and cluster
by that index [1].

Then if you want to get the most recent N entries, your SQL server is limited
pretty much only by the disk read rate, because it's just reading pages in the
order they're already stored physically on disk, rather than seeking based on
a non-clustered index.

So that really just leaves the possibility of specialized caching or archiving
logic as the primary benefit of a TSDB, as far as I can tell. To me, that's
not likely worth the added complexity or maintenance costs until I'm
processing a _lot_ of transactions.

For the most part, though, that's because I already have a _ton_ of SQL
experience. And of course, I could be totally off-base, and I'd want to
benchmark the two optimal solutions and see what the difference really is for
a given use case.

[1] [http://www.postgresql.org/docs/9.5/static/sql-
cluster.html](http://www.postgresql.org/docs/9.5/static/sql-cluster.html)

------
theCricketer
Database noob here: Could someone please explain why this is different from
other approaches and why this is interesting? How does it compare to BigQuery
on GCP or Elasticsearch? Are those even the right comparisons?

Is this interesting because they use Postgres?

~~~
ddorian43
Elasticsearch can't do very complex queries like postgresql can (probably
needs something like spark?).

Elasticsearch can't grow tables, you need to create a new one index with more
shards and transfer data.

BigQuery works only in column-store while citus works in column-store + row-
store (column-stores don't support update/delete). You can mix row/column-
store tables and query/join them together.

It is interesting because that it's latest postgres, example you can have
Hyperloglog column-types, which ~none else offers.

It is available as open-source wich bigquery isn't (elastic is though).

------
stuff4ben
Interesting because I was having a conversation with a colleague today
complaining about other teams at work we depend on for Oracle services that
can't do HA. Not sure if they don't know how, if they just don't think it
wokrs, or if it's too pricey for them. But we have a vendor product that needs
a DB and our users don't want any downtime. Curious if Citus can fill that gap
in the HA space.

------
esilverberg2
Could someone compare this to Redshift? Seems like they solve similar
problems...we use RS to monitor terabytes of time-series data.

~~~
craigkerstiens
In some ways we're similar to Redshift in others we're very different. At the
core we both have Postgres as a foundation, though Redshift's roots in
Postgres are from some time ago. Citus works with the latest Postgres release
and is an extension which doesn't fork the underlying database. Both us and
Redshift work with large scale data, from there we diverge a bit.

Redshift very much focuses on data warehousing, so commonly large batch loads
and then complex queries and analysis which run over a longer period of time
minutes and up commonly.

Citus focuses more on real-time analytics. We support real-time ingest so you
can insert directly into it, or of course bulk load as well. And then on a
query side because we have high parallelism customers typically are doing more
with real-time analytics, so queries that operate in the second or lower
range. An example is CloudFlare, which is a customer, uses us to power the
dashboard you see when you login as a user.

------
netshade
Would be cool if this included cstore_fdw as well (
[https://github.com/citusdata/cstore_fdw](https://github.com/citusdata/cstore_fdw)
).

~~~
craigkerstiens
At this point during the beta we're not supporting cstore, but it's definitely
on our roadmap for the future.

------
kevindeasis
I'm still waiting for a Database as a Service that has RethinkDB that also has
a free-tier/freemium model and not insanely expensive.

~~~
gokulj
Compose ([http://compose.io](http://compose.io)) offers RethinkDB, and doesn't
look too expensive, but no free tier :-(.

------
presspot
Kick ass team. They will surely nail it.

------
lambdafunc
Any benchmarks against Presto?

------
swayvil
citrus cloud?

